Hypothetical Document Embeddings (HyDE)

Hypothetical Document Expansion (HyDE) is a method for improving vector retrieval and retrieval-augmented generation (RAG), in which a large language model (LLM) generates a "hypothetical document" based on an initial query; this text is then vectorized by an encoder, and the search for real documents is performed based on proximity to the resulting vector. The approach allows leveraging "relevance patterns" encoded by the LLM and "grounding" them in a corpus using dense embeddings^[1].

Definition and Intuition

HyDE decomposes the search task into two stages:

(1) The LLM creates an "example of a relevant answer" (hypothetical document) for the query, thereby modeling the features of relevance;

(2) A contrastive encoder (e.g., Contriever) translates this text into a vector, which is then used to retrieve real documents from the index. The generated text may contain factual errors, but what is important are the thematic and terminological patterns captured by the encoder^[2].

History and Sources

The idea of enhancing search with synthetic texts dates back to work on query expansion and pseudo-relevance feedback (PRF): the Rocchio algorithm and relevance language models^[3]^[4]. For dense retrieval, contrastively trained encoders like Contriever^[5] and Dense Passage Retrieval (DPR)^[6] were used. The BEIR benchmark standardized zero-shot evaluation^[7]. Against this backdrop, HyDE was proposed as a way to "inject" relevance knowledge from an LLM into the zero-shot setting without fine-tuning the encoder^[8].

Method and Formalization

Let the document corpus be $𝒟 = {d_{1}, \dots, d_{N}}$ , and a text encoder $E : text \to ℝ^{n}$ define the vector representations of documents $𝐯_{d} = E (d)$ . To measure proximity, either cosine similarity or the dot product is used; an important note: **the dot product is equivalent to cosine similarity only when both vectors have a unit L2-norm** ( $‖ 𝐮 ‖ = ‖ 𝐯 ‖ = 1$ )^[9].

HyDE redefines the query representation through a "hypothetical document" generated by an LLM. Formally:

Failed to parse (unknown function "\begin{aligned}"): {\displaystyle \begin{aligned} &\text{(1) Generation of the hypothetical document:} && \tilde d \;=\; G\!\big(q;\,\mathrm{inst}\big), \\ &\text{(2) Embedding of the hypothetical document:} && \mathbf{v}_h \;=\; E(\tilde d), \\ &\text{(3) Nearest neighbor search:} && \mathcal{R}_k(q) \;=\; \mathrm{TopK}_{d\in\mathcal{D}}\; S\!\big(\mathbf{v}_h,\mathbf{v}_{d}\big), \end{aligned} }

where $G$ is an LLM with an instruction $i n s t$ (e.g., "Write a paragraph that answers the question..."), $S$ is the similarity measure (cosine or IP with normalization), and $ℛ_{k} (q)$ is the set of $k$ documents with the highest similarity^[10]^[11].

In engineering practice, it is common to generate **multiple** hypothetical documents and aggregate their representations to increase robustness:

${\tilde{d}}^{(j)} = G (q; i n s t, ξ_{j}), 𝐯_{h} = \frac{1}{m} \sum_{j = 1}^{m} E ({\tilde{d}}^{(j)}),$

where $ξ_{j}$ are stochastic decoding parameters (e.g., temperature/top-p). This ensembling improves Recall with a moderate increase in latency^[12].

Basic HyDE Pipeline

# 1) prompt(query) -> hypothetical_doc
# 2) embed(hypothetical_doc) -> v_h
# 3) retrieve(index, v_h, k) -> candidates
# 4) (optional) rerank(query, candidates) -> topN
# 5) (for RAG) stuff / map-reduce / refine on topN

Relation to Other Methods (QE, doc2query, PRF)

QE (Query Expansion) adds terms to the query; HyDE, instead, generates an entire "quasi-document," which aligns better with dense encoders^[13].
doc2query / docTTTTTquery expand documents with synthetic queries before indexing^[14]^[15]; HyDE expands the query on the fly, without requiring re-indexing.
PRF (Rocchio, Relevance LM) updates the query vector based on the top results; HyDE extracts the "relevance pattern" directly from the LLM and then "grounds" it via retrieval from the corpus^[16].

Integration into RAG and Reranking

In RAG, HyDE is applied as the first retrieval stage: hypothetical document → embedding → k candidates. This is followed by reranking using BERT-class cross-encoders^[17] or a late interaction model like ColBERT^[18]. For merging result lists (e.g., a BM25+vector hybrid), RRF (reciprocal rank fusion) is typically used: $RRF (d) = \sum_{r \in ℛ} \frac{1}{k + {rank}_{r} (d)}, k \approx 60 .$ The RRF method consistently improves the aggregate quality of combined rankings^[19].

Evaluation on Benchmarks (BEIR et al.)

The original paper evaluates HyDE in a zero-shot setting on TREC DL’19/20 (web search) and on a subset of the BEIR collections (Scifact, ArguAna, TREC-COVID, FiQA, DBPedia, TREC-NEWS, Climate-FEVER). A fragment of the results—as of July 2023:

*TREC DL19/20 (Web Search)* — mAP / nDCG@10 / Recall@1k
Method	DL19	DL20	Source
BM25	30.1 / 50.6 / 75.0	28.6 / 48.0 / 78.6	^[20]
Contriever (unsup.)	24.0 / 44.5 / 74.6	24.0 / 42.1 / 75.4	^[21]
HyDE (Contriever+LLM)	41.8 / 61.3 / 88.0	38.2 / 57.9 / 84.4	^[22]
DPR (ft)	36.5 / 62.2 / 76.9	41.8 / 65.3 / 81.4	^[23]
ANCE (ft)	37.1 / 64.5 / 75.5	40.8 / 64.6 / 77.6	^[24]

*BEIR (Selection of Datasets)* — nDCG@10 / Recall@100
Method	Scifact	ArguAna	TREC‑COVID	FiQA	DBPedia	TREC‑NEWS	Climate‑FEVER	Source
BM25	67.9 / 92.5	39.7 / 93.2	59.5 / 49.8	23.6 / 54.0	31.8 / 46.8	39.5 / 44.7	16.5 / 42.5	^[25]
Contriever	64.9 / 92.6	37.9 / 90.1	27.3 / 17.2	24.5 / 56.2	29.2 / 45.3	34.8 / 42.3	15.5 / 44.1	^[26]
HyDE	69.1 / 96.4	46.6 / 97.9	59.3 / 41.4	27.3 / 62.1	36.8 / 47.2	44.0 / 50.9	22.3 / 53.0	^[27]

HyDE also improves MRR@100 on the multilingual Mr.TyDi datasets (sw/ko/ja/bn) relative to mContriever^[28].

Practical Recommendations

When to use HyDE

Zero-shot/transfer settings (no relevance labels; domain dissimilarity from training corpora)^[29].
When higher Recall@k is needed with acceptable precision—HyDE often "unlocks" relevant areas of the vector space^[30].

Typical Settings

LLM and prompt: An instruction like "Write a paragraph that answers the question..."; moderate stochasticity (e.g., temperature≈0.7)^[31].
Number of hypothetical documents: 1–5; averaging embeddings improves robustness^[32].
Embedder: (m)Contriever without fine-tuning; fine-tuned encoders can also be used (the HyDE effect persists)^[33].
Embedding normalization: L2-norm; the inner product is equivalent to cosine similarity^[34].
Hybrid retrieval: BM25+vector followed by reranking^[35].
Reranker: Cross-Encoder (BERT re-ranker)^[36] or ColBERT^[37].
Merging results from different strategies: RRF (k≈60)^[38].

Quality/Cost Monitoring

Retrieval: nDCG@k, Recall@k, MRR; end-to-end RAG: EM/F1 or groundedness metrics (RAGAS/TruLens)^[39]^[40].
Cost/latency: Dominated by LLM generation and (if applicable) reranking; optimized by the number of hypothetical documents and response length^[41].

Limitations and Open Questions

Hallucinations in the hypothetical document: The LLM can introduce factual errors; "grounding" via the encoder and corpus reduces the risk but does not eliminate it entirely^[42].
Domain/Language limitations: The benefit of HyDE diminishes in highly specialized domains and for low-resource languages^[43].
Latency and cost: LLM generation adds delay and token costs; critical for online scenarios and long hypothetical documents^[44].
Ethics and biases: It is preferable to use safe LLMs and content filtering^[45].

Comparative Table of Methods

Comparison of HyDE and Related Approaches
Method	Class	Where text is generated	Encoder/Index	Reranker (2nd stage)	Typical Metrics (Example)	Cost/Latency	Sources
HyDE	Query→hypo‑doc	On the query side (LLM → paragraph)	(m)Contriever; ANN	BERT re‑rank / ColBERT / RRF	DL19 nDCG@10≈61.3; DL20≈57.9; ArguAna nDCG@10≈46.6	+ LLM generation; + reranking (opt.)	^[46]
BM25	Lexical	—	Inverted index	Optional	see table (above)	Low (lexical)	^[47]
DPR / ANCE	Dense (ft)	—	Bi‑encoder; ANN	Optional	DL19 nDCG@10≈62–65	Medium (no LLM)	^[48]^[49]
doc2query / docTTTTTquery	Doc. expansion	On the collection side (before indexing)	BM25/sparse+expanded	Optional	Improvements over BM25 on MS MARCO	High offline generation cost; fast online	^[50]^[51]
PRF (Rocchio, RLM)	QE via feedback	Query (from top results)	Any	Optional	Increased Recall / risk of drift	+ additional retrieval pass	^[52]

Literature

Manning, C. D.; Raghavan, P.; Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. ISBN 978‑0521865715.
Robertson, S.; Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in IR, 3(4), 333–389. DOI:10.1561/1500000019.

Links

HyDE Repository: github.com/texttron/hyde.
Documentation: Haystack — HyDE: docs.haystack.deepset.ai.
Documentation: LangChain — HyDE Retriever: docs.langchain.com.

Notes

↑ Gao, L.; Ma, X.; Lin, J.; Callan, J. (2023). ‘‘Precise Zero‑Shot Dense Retrieval without Relevance Labels (HyDE)’’. ACL 2023. pp. 1762–1777. DOI:10.18653/v1/2023.acl-long.99. arXiv:2212.10496
↑ Gao, L. et al. (2023). ACL 2023, §3.2. DOI:10.18653/v1/2023.acl-long.99.
↑ Rocchio, J. (1971). ‘‘Relevance Feedback in Information Retrieval’’. In: Salton, G. (ed.) The SMART Retrieval System. Prentice‑Hall, pp. 313–323. ISBN 978‑0138145255.
↑ Lavrenko, V.; Croft, W. B. (2001). ‘‘Relevance‑Based Language Models’’. SIGIR. DOI:10.1145/383952.383972.
↑ Izacard, G. et al. (2021/2022). ‘‘Unsupervised Dense Information Retrieval with Contrastive Learning’’. arXiv:2112.09118.
↑ Karpukhin, V. et al. (2020). ‘‘Dense Passage Retrieval for Open‑Domain QA’’. EMNLP. DOI:10.18653/v1/2020.emnlp-main.550.
↑ Thakur, N. et al. (2021). ‘‘BEIR: A Heterogeneous Benchmark for Zero‑shot Evaluation of Information Retrieval Models’’. NeurIPS Datasets Track. arXiv:2104.08663.
↑ Gao, L. et al. (2023). DOI:10.18653/v1/2023.acl-long.99.
↑ Milvus Docs. ‘‘Similarity Metrics’’ — With L2-normalized vectors, the inner product is equivalent to cosine similarity. URL: https://milvus.io/docs/v2.2.x/metric.md
↑ Gao, L.; Ma, X.; Lin, J.; Callan, J. (2023). ‘‘Precise Zero‑Shot Dense Retrieval without Relevance Labels (HyDE)’’. ACL 2023, §3–4. arXiv:2212.10496. DOI:10.18653/v1/2023.acl-long.99.
↑ Izacard, G. et al. (2021/2022). ‘‘Unsupervised Dense Information Retrieval with Contrastive Learning (Contriever)’’. arXiv:2112.09118.
↑ Gao, L. et al. (2023). Appendix (ablation): impact of the number of hypothetical documents and generation parameters. arXiv:2212.10496.
↑ Gao, L. et al. (2023). DOI:10.18653/v1/2023.acl-long.99.
↑ Nogueira, R. et al. (2019). ‘‘Document Expansion by Query Prediction’’ (doc2query). arXiv:1904.08375.
↑ Nogueira, R.; Lin, J. (2019). ‘‘From doc2query to docTTTTTquery’’ (tech report). PDF
↑ Rocchio, J. (1971); Lavrenko & Croft (2001), see above.
↑ Nogueira, R.; Cho, K. (2019). ‘‘Passage Re‑ranking with BERT’’. arXiv:1901.04085.
↑ Khattab, O.; Zaharia, M. (2020). ‘‘ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT’’. SIGIR. DOI:10.1145/3397271.3401075; arXiv:2004.12832.
↑ Cormack, G. V.; Clarke, C. L. A.; Büttcher, S. (2009). ‘‘Reciprocal Rank Fusion Outperforms Condorcet and Nearly Optimally Combines Rankings’’. SIGIR. DOI:10.1145/1571941.1572114.
↑ Gao, L. et al. (2023). Table 1. DOI:10.18653/v1/2023.acl-long.99.
↑ Izacard, G. et al. (2022); summary metrics in Gao et al., 2023, Table 1. arXiv:2112.09118.
↑ Gao, L. et al. (2023). Table 1.
↑ Karpukhin, V. et al. (2020); summary in Gao et al., 2023.
↑ Xiong, L. et al. (2021). ICLR. arXiv:2007.00808.
↑ Thakur, N. et al. (2021); summary in Gao et al., 2023, Table 2. arXiv:2104.08663.
↑ Izacard, G. et al. (2022); summary in Gao et al., 2023, Table 2.
↑ Gao, L. et al. (2023). Table 2.
↑ Gao, L. et al. (2023). Table 3. DOI:10.18653/v1/2023.acl-long.99.
↑ Gao, L. et al. (2023). §4–5.
↑ Gao, L. et al. (2023). §4.2–4.3.
↑ Gao, L. et al. (2023). §4.1.
↑ Haystack Docs. ‘‘Hypothetical Document Embeddings (HyDE)’’ (engineering reference). docs.haystack.deepset.ai
↑ Gao, L. et al. (2023). Table 6.
↑ Milvus Docs. ‘‘Similarity Metrics’’.
↑ Haystack × Milvus Integration (official docs). haystack.deepset.ai
↑ Nogueira, R.; Cho, K. (2019). arXiv:1901.04085.
↑ Khattab, O.; Zaharia, M. (2020). DOI:10.1145/3397271.3401075.
↑ Cormack, G. V. et al. (2009). DOI:10.1145/1571941.1572114.
↑ Manning, C. D.; Raghavan, P.; Schütze, H. (2008). Introduction to Information Retrieval. Cambridge Univ. Press. ISBN 978‑0521865715.
↑ Es, S. et al. (2023). ‘‘RAGAS: Automated Evaluation of Retrieval‑Augmented Generation’’. arXiv:2309.15217.
↑ Gao, L. et al. (2023). §5.
↑ Gao, L. et al. (2023). §3.2; §4.1. DOI:10.18653/v1/2023.acl-long.99.
↑ Gao, L. et al. (2023). Table 3; §4.4.
↑ Gao, L. et al. (2023). §4–5.
↑ Ouyang, L. et al. (2022). ‘‘Training language models to follow instructions with human feedback (InstructGPT)’’. NeurIPS. arXiv:2203.02155.
↑ Gao, L. et al. (2023). Tables 1–2.
↑ Robertson, S.; Zaragoza, H. (2009). ‘‘The Probabilistic Relevance Framework: BM25 and Beyond’’. Found. Trends IR. DOI:10.1561/1500000019.
↑ Karpukhin, V. et al. (2020). DOI:10.18653/v1/2020.emnlp-main.550.
↑ Xiong, L. et al. (2021). arXiv:2007.00808.
↑ Nogueira, R. et al. (2019). arXiv:1904.08375.
↑ Nogueira, R.; Lin, J. (2019). tech report.
↑ Rocchio, J. (1971). SMART; Lavrenko & Croft (2001) SIGIR.

[1] Gao, L.; Ma, X.; Lin, J.; Callan, J. (2023). ‘‘Precise Zero‑Shot Dense Retrieval without Relevance Labels (HyDE)’’. ACL 2023. pp. 1762–1777. DOI:10.18653/v1/2023.acl-long.99. arXiv:2212.10496

[2] Gao, L. et al. (2023). ACL 2023, §3.2. DOI:10.18653/v1/2023.acl-long.99.

[3] Rocchio, J. (1971). ‘‘Relevance Feedback in Information Retrieval’’. In: Salton, G. (ed.) The SMART Retrieval System. Prentice‑Hall, pp. 313–323. ISBN 978‑0138145255.

[4] Lavrenko, V.; Croft, W. B. (2001). ‘‘Relevance‑Based Language Models’’. SIGIR. DOI:10.1145/383952.383972.

[5] Izacard, G. et al. (2021/2022). ‘‘Unsupervised Dense Information Retrieval with Contrastive Learning’’. arXiv:2112.09118.

[6] Karpukhin, V. et al. (2020). ‘‘Dense Passage Retrieval for Open‑Domain QA’’. EMNLP. DOI:10.18653/v1/2020.emnlp-main.550.

[7] Thakur, N. et al. (2021). ‘‘BEIR: A Heterogeneous Benchmark for Zero‑shot Evaluation of Information Retrieval Models’’. NeurIPS Datasets Track. arXiv:2104.08663.

[8] Gao, L. et al. (2023). DOI:10.18653/v1/2023.acl-long.99.

[9] Milvus Docs. ‘‘Similarity Metrics’’ — With L2-normalized vectors, the inner product is equivalent to cosine similarity. URL: https://milvus.io/docs/v2.2.x/metric.md

[10] Gao, L.; Ma, X.; Lin, J.; Callan, J. (2023). ‘‘Precise Zero‑Shot Dense Retrieval without Relevance Labels (HyDE)’’. ACL 2023, §3–4. arXiv:2212.10496. DOI:10.18653/v1/2023.acl-long.99.

[11] Izacard, G. et al. (2021/2022). ‘‘Unsupervised Dense Information Retrieval with Contrastive Learning (Contriever)’’. arXiv:2112.09118.

[12] Gao, L. et al. (2023). Appendix (ablation): impact of the number of hypothetical documents and generation parameters. arXiv:2212.10496.

[13] Gao, L. et al. (2023). DOI:10.18653/v1/2023.acl-long.99.

[14] Nogueira, R. et al. (2019). ‘‘Document Expansion by Query Prediction’’ (doc2query). arXiv:1904.08375.

[15] Nogueira, R.; Lin, J. (2019). ‘‘From doc2query to docTTTTTquery’’ (tech report). PDF

[16] Rocchio, J. (1971); Lavrenko & Croft (2001), see above.

[17] Nogueira, R.; Cho, K. (2019). ‘‘Passage Re‑ranking with BERT’’. arXiv:1901.04085.

[18] Khattab, O.; Zaharia, M. (2020). ‘‘ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT’’. SIGIR. DOI:10.1145/3397271.3401075; arXiv:2004.12832.

[19] Cormack, G. V.; Clarke, C. L. A.; Büttcher, S. (2009). ‘‘Reciprocal Rank Fusion Outperforms Condorcet and Nearly Optimally Combines Rankings’’. SIGIR. DOI:10.1145/1571941.1572114.

[20] Gao, L. et al. (2023). Table 1. DOI:10.18653/v1/2023.acl-long.99.

[21] Izacard, G. et al. (2022); summary metrics in Gao et al., 2023, Table 1. arXiv:2112.09118.

[22] Gao, L. et al. (2023). Table 1.

[23] Karpukhin, V. et al. (2020); summary in Gao et al., 2023.

[24] Xiong, L. et al. (2021). ICLR. arXiv:2007.00808.

[25] Thakur, N. et al. (2021); summary in Gao et al., 2023, Table 2. arXiv:2104.08663.

[26] Izacard, G. et al. (2022); summary in Gao et al., 2023, Table 2.

[27] Gao, L. et al. (2023). Table 2.

[28] Gao, L. et al. (2023). Table 3. DOI:10.18653/v1/2023.acl-long.99.

[29] Gao, L. et al. (2023). §4–5.

[30] Gao, L. et al. (2023). §4.2–4.3.

[31] Gao, L. et al. (2023). §4.1.

[32] Haystack Docs. ‘‘Hypothetical Document Embeddings (HyDE)’’ (engineering reference). docs.haystack.deepset.ai

[33] Gao, L. et al. (2023). Table 6.

[34] Milvus Docs. ‘‘Similarity Metrics’’.

[35] Haystack × Milvus Integration (official docs). haystack.deepset.ai

[36] Nogueira, R.; Cho, K. (2019). arXiv:1901.04085.

[37] Khattab, O.; Zaharia, M. (2020). DOI:10.1145/3397271.3401075.

[38] Cormack, G. V. et al. (2009). DOI:10.1145/1571941.1572114.

[39] Manning, C. D.; Raghavan, P.; Schütze, H. (2008). Introduction to Information Retrieval. Cambridge Univ. Press. ISBN 978‑0521865715.

[40] Es, S. et al. (2023). ‘‘RAGAS: Automated Evaluation of Retrieval‑Augmented Generation’’. arXiv:2309.15217.

[41] Gao, L. et al. (2023). §5.

[42] Gao, L. et al. (2023). §3.2; §4.1. DOI:10.18653/v1/2023.acl-long.99.

[43] Gao, L. et al. (2023). Table 3; §4.4.

[44] Gao, L. et al. (2023). §4–5.

[45] Ouyang, L. et al. (2022). ‘‘Training language models to follow instructions with human feedback (InstructGPT)’’. NeurIPS. arXiv:2203.02155.

[46] Gao, L. et al. (2023). Tables 1–2.

[47] Robertson, S.; Zaragoza, H. (2009). ‘‘The Probabilistic Relevance Framework: BM25 and Beyond’’. Found. Trends IR. DOI:10.1561/1500000019.

[48] Karpukhin, V. et al. (2020). DOI:10.18653/v1/2020.emnlp-main.550.

[49] Xiong, L. et al. (2021). arXiv:2007.00808.

[50] Nogueira, R. et al. (2019). arXiv:1904.08375.

[51] Nogueira, R.; Lin, J. (2019). tech report.

[52] Rocchio, J. (1971). SMART; Lavrenko & Croft (2001) SIGIR.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

Hypothetical Document Embeddings (HyDE)

Contents

Definition and Intuition

History and Sources

Method and Formalization

Basic HyDE Pipeline

Relation to Other Methods (QE, doc2query, PRF)

Integration into RAG and Reranking

Evaluation on Benchmarks (BEIR et al.)

Practical Recommendations

Limitations and Open Questions

Comparative Table of Methods

See Also

Literature

Links

Notes

Navigation menu

Hypothetical Document Embeddings (HyDE)

Definition and Intuition

History and Sources

Method and Formalization

Basic HyDE Pipeline

Relation to Other Methods (QE, doc2query, PRF)

Integration into RAG and Reranking

Evaluation on Benchmarks (BEIR et al.)

Practical Recommendations

Limitations and Open Questions

Comparative Table of Methods

See Also

Literature

Links

Notes

Navigation menu

Search