Hypothetical Document Embeddings (HyDE)
Hypothetical Document Expansion (HyDE) is a method for improving vector retrieval and retrieval-augmented generation (RAG), in which a large language model (LLM) generates a "hypothetical document" based on an initial query; this text is then vectorized by an encoder, and the search for real documents is performed based on proximity to the resulting vector. The approach allows leveraging "relevance patterns" encoded by the LLM and "grounding" them in a corpus using dense embeddings[1].
Definition and Intuition
HyDE decomposes the search task into two stages:
(1) The LLM creates an "example of a relevant answer" (hypothetical document) for the query, thereby modeling the features of relevance;
(2) A contrastive encoder (e.g., Contriever) translates this text into a vector, which is then used to retrieve real documents from the index. The generated text may contain factual errors, but what is important are the thematic and terminological patterns captured by the encoder[2].
History and Sources
The idea of enhancing search with synthetic texts dates back to work on query expansion and pseudo-relevance feedback (PRF): the Rocchio algorithm and relevance language models[3][4]. For dense retrieval, contrastively trained encoders like Contriever[5] and Dense Passage Retrieval (DPR)[6] were used. The BEIR benchmark standardized zero-shot evaluation[7]. Against this backdrop, HyDE was proposed as a way to "inject" relevance knowledge from an LLM into the zero-shot setting without fine-tuning the encoder[8].
Method and Formalization
Let the document corpus be , and a text encoder define the vector representations of documents . To measure proximity, either cosine similarity or the dot product is used; an important note: **the dot product is equivalent to cosine similarity only when both vectors have a unit L2-norm** ()[9].
HyDE redefines the query representation through a "hypothetical document" generated by an LLM. Formally:
Failed to parse (unknown function "\begin{aligned}"): {\displaystyle \begin{aligned} &\text{(1) Generation of the hypothetical document:} && \tilde d \;=\; G\!\big(q;\,\mathrm{inst}\big), \\ &\text{(2) Embedding of the hypothetical document:} && \mathbf{v}_h \;=\; E(\tilde d), \\ &\text{(3) Nearest neighbor search:} && \mathcal{R}_k(q) \;=\; \mathrm{TopK}_{d\in\mathcal{D}}\; S\!\big(\mathbf{v}_h,\mathbf{v}_{d}\big), \end{aligned} }
where is an LLM with an instruction (e.g., "Write a paragraph that answers the question..."), is the similarity measure (cosine or IP with normalization), and is the set of documents with the highest similarity[10][11].
In engineering practice, it is common to generate **multiple** hypothetical documents and aggregate their representations to increase robustness:
where are stochastic decoding parameters (e.g., temperature/top-p). This ensembling improves Recall with a moderate increase in latency[12].
Basic HyDE Pipeline
# 1) prompt(query) -> hypothetical_doc # 2) embed(hypothetical_doc) -> v_h # 3) retrieve(index, v_h, k) -> candidates # 4) (optional) rerank(query, candidates) -> topN # 5) (for RAG) stuff / map-reduce / refine on topN
Relation to Other Methods (QE, doc2query, PRF)
- QE (Query Expansion) adds terms to the query; HyDE, instead, generates an entire "quasi-document," which aligns better with dense encoders[13].
- doc2query / docTTTTTquery expand documents with synthetic queries before indexing[14][15]; HyDE expands the query on the fly, without requiring re-indexing.
- PRF (Rocchio, Relevance LM) updates the query vector based on the top results; HyDE extracts the "relevance pattern" directly from the LLM and then "grounds" it via retrieval from the corpus[16].
Integration into RAG and Reranking
In RAG, HyDE is applied as the first retrieval stage: hypothetical document → embedding → k candidates. This is followed by reranking using BERT-class cross-encoders[17] or a late interaction model like ColBERT[18]. For merging result lists (e.g., a BM25+vector hybrid), RRF (reciprocal rank fusion) is typically used: The RRF method consistently improves the aggregate quality of combined rankings[19].
Evaluation on Benchmarks (BEIR et al.)
The original paper evaluates HyDE in a zero-shot setting on TREC DL’19/20 (web search) and on a subset of the BEIR collections (Scifact, ArguAna, TREC-COVID, FiQA, DBPedia, TREC-NEWS, Climate-FEVER). A fragment of the results—as of July 2023:
| Method | DL19 | DL20 | Source |
|---|---|---|---|
| BM25 | 30.1 / 50.6 / 75.0 | 28.6 / 48.0 / 78.6 | [20] |
| Contriever (unsup.) | 24.0 / 44.5 / 74.6 | 24.0 / 42.1 / 75.4 | [21] |
| HyDE (Contriever+LLM) | 41.8 / 61.3 / 88.0 | 38.2 / 57.9 / 84.4 | [22] |
| DPR (ft) | 36.5 / 62.2 / 76.9 | 41.8 / 65.3 / 81.4 | [23] |
| ANCE (ft) | 37.1 / 64.5 / 75.5 | 40.8 / 64.6 / 77.6 | [24] |
| Method | Scifact | ArguAna | TREC‑COVID | FiQA | DBPedia | TREC‑NEWS | Climate‑FEVER | Source |
|---|---|---|---|---|---|---|---|---|
| BM25 | 67.9 / 92.5 | 39.7 / 93.2 | 59.5 / 49.8 | 23.6 / 54.0 | 31.8 / 46.8 | 39.5 / 44.7 | 16.5 / 42.5 | [25] |
| Contriever | 64.9 / 92.6 | 37.9 / 90.1 | 27.3 / 17.2 | 24.5 / 56.2 | 29.2 / 45.3 | 34.8 / 42.3 | 15.5 / 44.1 | [26] |
| HyDE | 69.1 / 96.4 | 46.6 / 97.9 | 59.3 / 41.4 | 27.3 / 62.1 | 36.8 / 47.2 | 44.0 / 50.9 | 22.3 / 53.0 | [27] |
HyDE also improves MRR@100 on the multilingual Mr.TyDi datasets (sw/ko/ja/bn) relative to mContriever[28].
Practical Recommendations
- When to use HyDE
- Zero-shot/transfer settings (no relevance labels; domain dissimilarity from training corpora)[29].
- When higher Recall@k is needed with acceptable precision—HyDE often "unlocks" relevant areas of the vector space[30].
- Typical Settings
- LLM and prompt: An instruction like "Write a paragraph that answers the question..."; moderate stochasticity (e.g., temperature≈0.7)[31].
- Number of hypothetical documents: 1–5; averaging embeddings improves robustness[32].
- Embedder: (m)Contriever without fine-tuning; fine-tuned encoders can also be used (the HyDE effect persists)[33].
- Embedding normalization: L2-norm; the inner product is equivalent to cosine similarity[34].
- Hybrid retrieval: BM25+vector followed by reranking[35].
- Reranker: Cross-Encoder (BERT re-ranker)[36] or ColBERT[37].
- Merging results from different strategies: RRF (k≈60)[38].
- Quality/Cost Monitoring
- Retrieval: nDCG@k, Recall@k, MRR; end-to-end RAG: EM/F1 or groundedness metrics (RAGAS/TruLens)[39][40].
- Cost/latency: Dominated by LLM generation and (if applicable) reranking; optimized by the number of hypothetical documents and response length[41].
Limitations and Open Questions
- Hallucinations in the hypothetical document: The LLM can introduce factual errors; "grounding" via the encoder and corpus reduces the risk but does not eliminate it entirely[42].
- Domain/Language limitations: The benefit of HyDE diminishes in highly specialized domains and for low-resource languages[43].
- Latency and cost: LLM generation adds delay and token costs; critical for online scenarios and long hypothetical documents[44].
- Ethics and biases: It is preferable to use safe LLMs and content filtering[45].
Comparative Table of Methods
| Method | Class | Where text is generated | Encoder/Index | Reranker (2nd stage) | Typical Metrics (Example) | Cost/Latency | Sources |
|---|---|---|---|---|---|---|---|
| HyDE | Query→hypo‑doc | On the query side (LLM → paragraph) | (m)Contriever; ANN | BERT re‑rank / ColBERT / RRF | DL19 nDCG@10≈61.3; DL20≈57.9; ArguAna nDCG@10≈46.6 | + LLM generation; + reranking (opt.) | [46] |
| BM25 | Lexical | — | Inverted index | Optional | see table (above) | Low (lexical) | [47] |
| DPR / ANCE | Dense (ft) | — | Bi‑encoder; ANN | Optional | DL19 nDCG@10≈62–65 | Medium (no LLM) | [48][49] |
| doc2query / docTTTTTquery | Doc. expansion | On the collection side (before indexing) | BM25/sparse+expanded | Optional | Improvements over BM25 on MS MARCO | High offline generation cost; fast online | [50][51] |
| PRF (Rocchio, RLM) | QE via feedback | Query (from top results) | Any | Optional | Increased Recall / risk of drift | + additional retrieval pass | [52] |
See Also
- BM25
- Vector search
- RAG
- Pseudo-relevance feedback
- BEIR
Literature
- Manning, C. D.; Raghavan, P.; Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. ISBN 978‑0521865715.
- Robertson, S.; Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in IR, 3(4), 333–389. DOI:10.1561/1500000019.
Links
- HyDE Repository: github.com/texttron/hyde.
- Documentation: Haystack — HyDE: docs.haystack.deepset.ai.
- Documentation: LangChain — HyDE Retriever: docs.langchain.com.
Notes
- ↑ Gao, L.; Ma, X.; Lin, J.; Callan, J. (2023). ‘‘Precise Zero‑Shot Dense Retrieval without Relevance Labels (HyDE)’’. ACL 2023. pp. 1762–1777. DOI:10.18653/v1/2023.acl-long.99. arXiv:2212.10496
- ↑ Gao, L. et al. (2023). ACL 2023, §3.2. DOI:10.18653/v1/2023.acl-long.99.
- ↑ Rocchio, J. (1971). ‘‘Relevance Feedback in Information Retrieval’’. In: Salton, G. (ed.) The SMART Retrieval System. Prentice‑Hall, pp. 313–323. ISBN 978‑0138145255.
- ↑ Lavrenko, V.; Croft, W. B. (2001). ‘‘Relevance‑Based Language Models’’. SIGIR. DOI:10.1145/383952.383972.
- ↑ Izacard, G. et al. (2021/2022). ‘‘Unsupervised Dense Information Retrieval with Contrastive Learning’’. arXiv:2112.09118.
- ↑ Karpukhin, V. et al. (2020). ‘‘Dense Passage Retrieval for Open‑Domain QA’’. EMNLP. DOI:10.18653/v1/2020.emnlp-main.550.
- ↑ Thakur, N. et al. (2021). ‘‘BEIR: A Heterogeneous Benchmark for Zero‑shot Evaluation of Information Retrieval Models’’. NeurIPS Datasets Track. arXiv:2104.08663.
- ↑ Gao, L. et al. (2023). DOI:10.18653/v1/2023.acl-long.99.
- ↑ Milvus Docs. ‘‘Similarity Metrics’’ — With L2-normalized vectors, the inner product is equivalent to cosine similarity. URL: https://milvus.io/docs/v2.2.x/metric.md
- ↑ Gao, L.; Ma, X.; Lin, J.; Callan, J. (2023). ‘‘Precise Zero‑Shot Dense Retrieval without Relevance Labels (HyDE)’’. ACL 2023, §3–4. arXiv:2212.10496. DOI:10.18653/v1/2023.acl-long.99.
- ↑ Izacard, G. et al. (2021/2022). ‘‘Unsupervised Dense Information Retrieval with Contrastive Learning (Contriever)’’. arXiv:2112.09118.
- ↑ Gao, L. et al. (2023). Appendix (ablation): impact of the number of hypothetical documents and generation parameters. arXiv:2212.10496.
- ↑ Gao, L. et al. (2023). DOI:10.18653/v1/2023.acl-long.99.
- ↑ Nogueira, R. et al. (2019). ‘‘Document Expansion by Query Prediction’’ (doc2query). arXiv:1904.08375.
- ↑ Nogueira, R.; Lin, J. (2019). ‘‘From doc2query to docTTTTTquery’’ (tech report). PDF
- ↑ Rocchio, J. (1971); Lavrenko & Croft (2001), see above.
- ↑ Nogueira, R.; Cho, K. (2019). ‘‘Passage Re‑ranking with BERT’’. arXiv:1901.04085.
- ↑ Khattab, O.; Zaharia, M. (2020). ‘‘ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT’’. SIGIR. DOI:10.1145/3397271.3401075; arXiv:2004.12832.
- ↑ Cormack, G. V.; Clarke, C. L. A.; Büttcher, S. (2009). ‘‘Reciprocal Rank Fusion Outperforms Condorcet and Nearly Optimally Combines Rankings’’. SIGIR. DOI:10.1145/1571941.1572114.
- ↑ Gao, L. et al. (2023). Table 1. DOI:10.18653/v1/2023.acl-long.99.
- ↑ Izacard, G. et al. (2022); summary metrics in Gao et al., 2023, Table 1. arXiv:2112.09118.
- ↑ Gao, L. et al. (2023). Table 1.
- ↑ Karpukhin, V. et al. (2020); summary in Gao et al., 2023.
- ↑ Xiong, L. et al. (2021). ICLR. arXiv:2007.00808.
- ↑ Thakur, N. et al. (2021); summary in Gao et al., 2023, Table 2. arXiv:2104.08663.
- ↑ Izacard, G. et al. (2022); summary in Gao et al., 2023, Table 2.
- ↑ Gao, L. et al. (2023). Table 2.
- ↑ Gao, L. et al. (2023). Table 3. DOI:10.18653/v1/2023.acl-long.99.
- ↑ Gao, L. et al. (2023). §4–5.
- ↑ Gao, L. et al. (2023). §4.2–4.3.
- ↑ Gao, L. et al. (2023). §4.1.
- ↑ Haystack Docs. ‘‘Hypothetical Document Embeddings (HyDE)’’ (engineering reference). docs.haystack.deepset.ai
- ↑ Gao, L. et al. (2023). Table 6.
- ↑ Milvus Docs. ‘‘Similarity Metrics’’.
- ↑ Haystack × Milvus Integration (official docs). haystack.deepset.ai
- ↑ Nogueira, R.; Cho, K. (2019). arXiv:1901.04085.
- ↑ Khattab, O.; Zaharia, M. (2020). DOI:10.1145/3397271.3401075.
- ↑ Cormack, G. V. et al. (2009). DOI:10.1145/1571941.1572114.
- ↑ Manning, C. D.; Raghavan, P.; Schütze, H. (2008). Introduction to Information Retrieval. Cambridge Univ. Press. ISBN 978‑0521865715.
- ↑ Es, S. et al. (2023). ‘‘RAGAS: Automated Evaluation of Retrieval‑Augmented Generation’’. arXiv:2309.15217.
- ↑ Gao, L. et al. (2023). §5.
- ↑ Gao, L. et al. (2023). §3.2; §4.1. DOI:10.18653/v1/2023.acl-long.99.
- ↑ Gao, L. et al. (2023). Table 3; §4.4.
- ↑ Gao, L. et al. (2023). §4–5.
- ↑ Ouyang, L. et al. (2022). ‘‘Training language models to follow instructions with human feedback (InstructGPT)’’. NeurIPS. arXiv:2203.02155.
- ↑ Gao, L. et al. (2023). Tables 1–2.
- ↑ Robertson, S.; Zaragoza, H. (2009). ‘‘The Probabilistic Relevance Framework: BM25 and Beyond’’. Found. Trends IR. DOI:10.1561/1500000019.
- ↑ Karpukhin, V. et al. (2020). DOI:10.18653/v1/2020.emnlp-main.550.
- ↑ Xiong, L. et al. (2021). arXiv:2007.00808.
- ↑ Nogueira, R. et al. (2019). arXiv:1904.08375.
- ↑ Nogueira, R.; Lin, J. (2019). tech report.
- ↑ Rocchio, J. (1971). SMART; Lavrenko & Croft (2001) SIGIR.