Hybrid retrieval
Hybrid Retrieval (hybrid search) is a class of information retrieval methods that combine lexical (sparse) and semantic (dense/late-interaction) signals to improve the recall and precision of search results. Hybrid schemes leverage the advantages of exact term matching (BM25/TF-IDF) and vector similarity (bi-encoders, late-interaction models), while also using methods for rank fusion that are robust to different score scales (e.g., Reciprocal Rank Fusion, CombSUM/CombMNZ) and reranking with cross-encoders.[1][2][3]
Definition and Motivation
Hybrid retrieval is a parallel or cascaded search process across two (or more) independent signal channels, followed by fusion and/or reranking. Typical motivations include: (i) overcoming the "vocabulary mismatch" problem (synonyms, paraphrasing), (ii) robustness to typos and morphology, (iii) retrieving specific codes or identifiers (where sparse models excel), and (iv) transferring to new domains or languages (where dense models provide semantic generalization).[4][5][6]
Components of Hybrid Search
Lexical (sparse)
- Classical models. TF-IDF and BM25/BM25F are standard baseline methods based on inverted indexes; BM25 is grounded in the probabilistic relevance framework (PRF) and is widely used for first-stage ranking.[7]
- Learned sparse models.
- SPLADE / SPLADE++/v3. A neural sparse model that learns term expansion and weighting via an MLM head with sparsity regularization; it demonstrates strong results and good transferability (BEIR).[8][9][10]
- uniCOIL/COIL. Contextualized inverted lists and their simplified version, uniCOIL; they are compatible with classic inverted indexes.[11]
Semantic (dense/late-interaction)
- Bi-encoder (single-vector). The query and document are encoded by vector models, and similarity is calculated using dot-product/MIPS. Examples: DPR,[12] ANCE,[13] Contriever,[14] GTR,[15] E5.[16]
- Late-interaction (multi-vector). These models capture token-level interactions at a 'late' stage: ColBERT/ColBERTv2. The trade-off is higher precision for a larger index and greater latency, which is mitigated by engineering drivers like PLAID and WARP.[17][18][19]
Hybridization Schemes and Rank Fusion
- Parallel search and candidate fusion. Candidate lists (sparse and dense) with their internal scores are retrieved independently, followed by rank fusion.[20]
- RRF (Reciprocal Rank Fusion). A technique robust to incomparable ranking scores that sums reciprocal ranks:
, where typically .[21] Supported in production search engines (Elasticsearch/OpenSearch) as a built-in retriever/processor.[22][23]
- CombSUM/CombMNZ et al. Classic 'score summation' functions (with normalization, if needed).[24][25][26]
- Weighted linear combination.
, . The choice of can be fixed or learned (per-collection or per-query).[27]
- Score normalization. For CombSUM/CombMNZ, methods like min-max or z-score are often used to align scales;[28] alternatively, RRF relies only on ranks.
- Dynamic/adaptive weighting. Query routing, query features, and LTR models for selecting/weighting channels; recent work shows that a simple learned combination often outperforms RRF and is less sensitive to normalization.[29]
Reranking and Multi-Stage Pipelines
Hybrid systems are typically built as retrieval → fusion → rerank pipelines. Reranking is performed using:
- Cross-encoders (BERT/T5). The most accurate but computationally expensive: MonoBERT/MonoT5 for reordering the top-N candidates.[30][31]
- Late-interaction as a reranker. The ColBERT family can also act as a reranker; modern accelerators (PLAID, WARP) reduce latency without quality loss.[32][33]
The quality ↔ latency/cost trade-off is particularly important in RAG applications and under strict SLAs (see p95/p99 tail latencies).[34]
Evaluation on Benchmarks
- BEIR. A unified set of heterogeneous collections/tasks for zero-shot/out-of-domain evaluation of retrievers (e.g., TREC‑COVID, NFCorpus, NQ, HotpotQA, FiQA‑2018, DBPedia-entity, ArguAna, Webis-Touché-2020, FEVER/Climate-FEVER, Scidocs, SciFact, CQADupStack, etc.).[35]
- TREC Deep Learning / MS MARCO. Classic resources for training/evaluating retrievers and rerankers on large-scale data.[36][37][38]
- Quality metrics. nDCG@k, Recall@k, MRR; for performance: latency p50/p95/p99, QPS; for operations: memory/cost (CPU/GPU, index).[39][40]
- Ablation studies. It is recommended to measure the contribution of each channel/weight and the sensitivity to parameters like in RRF and in linear combination; evaluate robustness to paraphrasing and OOD shifts.[41][42]
Engineering Aspects and Production Practices
- Indexes and ANN. FAISS (Flat/HNSW/IVF-PQ), HNSW, and ScaNN for MIPS/cosine similarity.[43][44][45]
- IR Stack. Lucene/Anserini/Pyserini for sparse, dense, and hybrid pipelines; 'turnkey' reproducibility on BEIR.[46][47]
- Vector Databases and Search Engines. Qdrant, Weaviate, pgvector/PostgreSQL, Vespa, and Elasticsearch/OpenSearch have native modes for hybrid search (BM25F+vector) and/or RRF/linear combination.[48][49][50][51][52]
- RAG Pattern. Architecture: retrieval → fusion → rerank → LLM context with token limits and source tracing.[53]
- Index updates, deduplication, tokenization. It is important to align tokenization between BM25 and the vectorizer; scores should be calibrated (normalized/scaled) before fusion.[54]
Limitations and Open Questions
- Transferability and Multilingualism. Dense models (GTR/E5) improve transfer but are sensitive to domain/language; sparse models (SPLADE) are often more robust on OOD.[55][56]
- Integration with LLMs and Hallucinations. Hybrid retrieval reduces omissions and noise in RAG contexts but does not completely eliminate hallucinations, requiring strict rerankers and source filtering.[57]
- Cost and Privacy. Storage of multi-vector indexes, compression, encryption, and on-premise stacks; TCO assessment.
- Trends. HyDE/doc2query/PRF as document/query expansion;[58][59] learning to blend (per-query ), more efficient late-interaction models (PLAID/WARP), long documents, and multi-vector indexes.[60][61]
Comparative Table of Methods
As of 2025-09-10 (example on the BEIR trec-covid collection; nDCG@10 / Recall@100):[62]
| Method | Type (sparse/dense/hybrid) | Concept/Model | Fusion Scheme | Reranker | nDCG@10 / R@100 | Latency (rel.) | Sources |
|---|---|---|---|---|---|---|---|
| BM25 | sparse | Exact term matching (PRF/BM25) | — | — | 0.595 / 0.109 | very low | [63][64] |
| SPLADE++ (ED) | sparse (learned) | Sparse term expansion/weighting | — | — | 0.727 / 0.128 | low–medium | [65][66] |
| Contriever (MS MARCO FT) | dense | Bi-encoder with contrastive learning | — | — | 0.596 / 0.091 | medium | [67][68] |
| BGE-base-en-v1.5 | dense | Strong general-purpose embedder | — | — | 0.781 / 0.141 | medium | [69] |
| Cohere embed-english-v3.0 | dense | Production-grade text embedding model | — | — | 0.818 / 0.159 | medium | [70] |
| BM25 + dense (e.g., BM25+BGE) | hybrid | Parallel retrieval + list fusion | RRF (k≈60) or weighted combination | opt.: MonoT5/ColBERT | (varies by implementation; typically > best single channel) | medium | [71][72][73] |
Note: The last row illustrates the scheme; exact numbers depend on the choice of embedder, normalization, and fusion parameters (see sources and reproducible Pyserini scripts).
Literature
- Manning, C.D., Raghavan, P., Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. ISBN 978‑0521865715.
- Robertson, S., Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval 3(4):333–389. DOI:10.1561/1500000019.
- Lin, J. et al. (2021). Pyserini: A Python Toolkit for Reproducible IR. SIGIR.
- Järvelin, K., Kekäläinen, J. (2002). Cumulated Gain‑Based Evaluation of IR Techniques. Information Retrieval 6:241–256. DOI:10.1023/A:1016043826386.
- Dean, J., Barroso, L.A. (203). The Tail at Scale. CACM 56(2):74–80. DOI:10.1145/2408776.2408794.
Links
- Pyserini / Anserini: github.com/castorini/pyserini • github.com/castorini/anserini
- FAISS: arXiv:1702.08734
- Weaviate (Hybrid search): docs.weaviate.io/weaviate/search/hybrid
- pgvector: github.com/pgvector/pgvector
- Vespa (Hybrid search tutorial): docs.vespa.ai/en/tutorials/hybrid-search.html
Notes
- ↑ Robertson, S., Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389. DOI:10.1561/1500000019.
- ↑ Cormack, G.V., Clarke, C.L.A., Büttcher, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR 2009, 758–759. PDF.
- ↑ Bruch, S., Gai, S., Ingber, A. (2023). An Analysis of Fusion Functions for Hybrid Retrieval. ACM TOIS 42(1):1–35. DOI:10.1145/3596512 • arXiv:2210.11934.
- ↑ Manning, C.D., Raghavan, P., Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. ISBN 978‑0521865715 (see chapters on TF‑IDF, evaluation, and the vocabulary mismatch problem).
- ↑ Izacard, G. et al. (2022). Unsupervised Dense Information Retrieval with Contrastive Learning (Contriever). TACL 10:1089–1108. arXiv:2112.09118.
- ↑ Wang, L. et al. (2022/2024). Text Embeddings by Weakly‑Supervised Contrastive Pre‑training (E5). arXiv:2212.03533.
- ↑ Robertson, S., Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. DOI:10.1561/1500000019.
- ↑ Formal, T., Piwowarski, B., Clinchant, S. (2021). SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. arXiv:2107.05720.
- ↑ Formal, T. et al. (2022). Making Sparse Neural IR Models More Effective. Findings of EMNLP. arXiv:2205.04733.
- ↑ Formal, T. et al. (2024). SPLADE‑v3: New baselines for SPLADE. arXiv:2403.06789.
- ↑ Lin, J., Ma, X. (2021). A Few Brief Notes on DeepImpact, COIL, and uniCOIL. arXiv:2106.14807.
- ↑ Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open‑Domain QA. EMNLP. arXiv:2004.04906.
- ↑ Xiong, L. et al. (2021). Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval (ANCE). ICLR. arXiv:2007.00808.
- ↑ Izacard, G. et al. (2022). TACL. arXiv:2112.09118.
- ↑ Ni, J. et al. (2021/2022). Large Dual Encoders Are Generalizable Retrievers (GTR). EMNLP. arXiv:2112.07899.
- ↑ Wang, L. et al. (2022/2024). arXiv:2212.03533.
- ↑ Khattab, O., Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR. arXiv:2004.12832.
- ↑ Santhanam, K. et al. (2022). ColBERTv2 & PLAID. NAACL/ArXiv. arXiv:2112.01488; arXiv:2205.09707.
- ↑ Scheerer, J.L. et al. (2025). WARP: An Efficient Engine for Multi‑Vector Retrieval. arXiv:2501.17788.
- ↑ Lin, J. et al. (2021). Pyserini: A Python Toolkit for Reproducible IR with Sparse and Dense Representations. SIGIR. PDF.
- ↑ Cormack, G.V., Clarke, C.L.A., Büttcher, S. (2009). SIGIR. PDF.
- ↑ Elastic Docs. Reciprocal Rank Fusion. (accessed 2025‑09‑10). elastic.co/docs/.../reciprocal-rank-fusion.
- ↑ OpenSearch Docs. Score ranker processor (RRF). (accessed 2025‑09‑10). docs.opensearch.org/.../score-ranker-processor/.
- ↑ Fox, E.A., Shaw, J.A. (1994). Combination of Multiple Searches. TREC‑2, NIST SP 500‑215, 243–252. PDF.
- ↑ Lee, J.H. (1997). Analyses of Multiple Evidence Combination. SIGIR, 267–276. DOI:10.1145/258525.258587.
- ↑ Hsu, D.F., Taksa, I. (2005). Comparing Rank and Score Combination Methods for Data Fusion in IR. (Tech. report). PDF.
- ↑ Bruch, S., Gai, S., Ingber, A. (2023). TOIS. DOI:10.1145/3596512.
- ↑ Hsu, D.F., Taksa, I. (2005). See above.
- ↑ Bruch, S., Gai, S., Ingber, A. (2023). TOIS. DOI:10.1145/3596512.
- ↑ Nogueira, R., Cho, K. (2019). Passage Re‑ranking with BERT. arXiv:1901.04085.
- ↑ Nogueira, R., Jiang, Z., Lin, J. (2020). Document Ranking with a Pretrained Sequence‑to‑Sequence Model (MonoT5). Findings of EMNLP. arXiv:2003.06713.
- ↑ Santhanam, K. et al. (2022). arXiv:2205.09707.
- ↑ Scheerer, J.L. et al. (2025). arXiv:2501.17788.
- ↑ Dean, J., Barroso, L.A. (2013). The Tail at Scale. CACM 56(2):74–80. DOI:10.1145/2408776.2408794.
- ↑ Thakur, N. et al. (2021). BEIR: A Heterogeneous Benchmark for Zero‑shot Evaluation of IR Models. NeurIPS Datasets & Benchmarks. arXiv:2104.08663.
- ↑ Craswell, N. et al. (2020). Overview of the TREC 2019 Deep Learning Track. arXiv:2003.07820.
- ↑ Craswell, N. et al. (2021). Overview of the TREC 2020 Deep Learning Track. arXiv:2102.07662.
- ↑ Bajaj, P. et al. (2016). MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268.
- ↑ Järvelin, K., Kekäläinen, J. (2002). Cumulated Gain‑Based Evaluation of IR Techniques. Information Retrieval 6:241–256. DOI:10.1023/A:1016043826386.
- ↑ Dean, J., Barroso, L.A. (2013). CACM. DOI:10.1145/2408776.2408794.
- ↑ Bruch, S. et al. (2023). DOI:10.1145/3596512.
- ↑ Ni, J. et al. (2021/2022). arXiv:2112.07899.
- ↑ Johnson, J., Douze, M., Jégou, H. (2017). Billion‑scale Similarity Search with GPUs (FAISS). arXiv:1702.08734.
- ↑ Malkov, Y., Yashunin, D. (2020). HNSW. IEEE TPAMI 42(4):824–836. DOI:10.1109/TPAMI.2018.2889473.
- ↑ Guo, R. et al. (2020). ScaNN: Efficient Vector Similarity Search at Scale. arXiv:1908.10396.
- ↑ Yang, P., Fang, H., Lin, J. (2018). Anserini: Reproducible IR Research with Lucene. JDIQ 10(4):1–20. DOI:10.1145/3239571.
- ↑ Lin, J. et al. (2021). SIGIR. PDF.
- ↑ Qdrant Docs. Hybrid queries (RRF, DBSF). (accessed 2025‑09‑10). qdrant.tech/.../hybrid-queries/.
- ↑ Weaviate Docs. Hybrid search. (accessed 2025‑09‑10). docs.weaviate.io/weaviate/search/hybrid.
- ↑ pgvector GitHub. (accessed 2025‑09‑10). github.com/pgvector/pgvector.
- ↑ Vespa Docs. Hybrid Text Search Tutorial. (accessed 2025‑09‑10). docs.vespa.ai/.../hybrid-search.html.
- ↑ Elastic Docs. Reciprocal Rank Fusion. (accessed 2025‑09‑10). elastic.co/docs/.../rrf.
- ↑ Lewis, P. et al. (2020). Retrieval‑Augmented Generation for Knowledge‑Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.
- ↑ Hsu, D.F., Taksa, I. (2005). See above.
- ↑ Ni, J. et al. (2021/2022). arXiv:2112.07899.
- ↑ Formal, T. et al. (2021, 2022, 2024). arXiv:2107.05720; 2205.04733; 2403.06789.
- ↑ Lewis, P. et al. (2020). arXiv:2005.11401.
- ↑ Gao, L. et al. (2023). Precise Zero‑Shot Dense Retrieval without Relevance Labels (HyDE). ACL. arXiv:2212.10496.
- ↑ Nogueira, R. et al. (2019). Document Expansion by Query Prediction. arXiv:1904.08375; docTTTTTquery. PDF.
- ↑ Santhanam, K. et al. (2022). arXiv:2205.09707.
- ↑ Scheerer, J.L. et al. (2025). arXiv:2501.17788.
- ↑ Pyserini BEIR Regressions (accessed 2025‑09‑10): results for trec-covid for BM25/SPLADE/Contriever/BGE/Cohere. castorini.github.io/pyserini/2cr/beir.html.
- ↑ Robertson, S., Zaragoza, H. (2009). DOI:10.1561/1500000019.
- ↑ Pyserini BEIR. See link above.
- ↑ Formal, T. et al. (2021, 2022). arXiv:2107.05720; 2205.04733.
- ↑ Pyserini BEIR.
- ↑ Izacard, G. et al. (2022). arXiv:2112.09118.
- ↑ Pyserini BEIR.
- ↑ Pyserini BEIR.
- ↑ Pyserini BEIR.
- ↑ Cormack et al. (2009). SIGIR. RRF.
- ↑ Bruch et al. (2023). TOIS.
- ↑ Elastic/OpenSearch RRF Docs.