Hybrid retrieval

From Systems Analysis Wiki
Jump to navigation Jump to search

Hybrid Retrieval (hybrid search) is a class of information retrieval methods that combine lexical (sparse) and semantic (dense/late-interaction) signals to improve the recall and precision of search results. Hybrid schemes leverage the advantages of exact term matching (BM25/TF-IDF) and vector similarity (bi-encoders, late-interaction models), while also using methods for rank fusion that are robust to different score scales (e.g., Reciprocal Rank Fusion, CombSUM/CombMNZ) and reranking with cross-encoders.[1][2][3]

Definition and Motivation

Hybrid retrieval is a parallel or cascaded search process across two (or more) independent signal channels, followed by fusion and/or reranking. Typical motivations include: (i) overcoming the "vocabulary mismatch" problem (synonyms, paraphrasing), (ii) robustness to typos and morphology, (iii) retrieving specific codes or identifiers (where sparse models excel), and (iv) transferring to new domains or languages (where dense models provide semantic generalization).[4][5][6]

Lexical (sparse)

  • Classical models. TF-IDF and BM25/BM25F are standard baseline methods based on inverted indexes; BM25 is grounded in the probabilistic relevance framework (PRF) and is widely used for first-stage ranking.[7]
  • Learned sparse models.
    • SPLADE / SPLADE++/v3. A neural sparse model that learns term expansion and weighting via an MLM head with sparsity regularization; it demonstrates strong results and good transferability (BEIR).[8][9][10]
    • uniCOIL/COIL. Contextualized inverted lists and their simplified version, uniCOIL; they are compatible with classic inverted indexes.[11]

Semantic (dense/late-interaction)

  • Bi-encoder (single-vector). The query and document are encoded by vector models, and similarity is calculated using dot-product/MIPS. Examples: DPR,[12] ANCE,[13] Contriever,[14] GTR,[15] E5.[16]
  • Late-interaction (multi-vector). These models capture token-level interactions at a 'late' stage: ColBERT/ColBERTv2. The trade-off is higher precision for a larger index and greater latency, which is mitigated by engineering drivers like PLAID and WARP.[17][18][19]

Hybridization Schemes and Rank Fusion

  • Parallel search and candidate fusion. Candidate lists (sparse and dense) with their internal scores are retrieved independently, followed by rank fusion.[20]
  • RRF (Reciprocal Rank Fusion). A technique robust to incomparable ranking scores that sums reciprocal ranks:

RRF(d)=i=1m1k+ranki(d), where typically k60.[21] Supported in production search engines (Elasticsearch/OpenSearch) as a built-in retriever/processor.[22][23]

  • CombSUM/CombMNZ et al. Classic 'score summation' functions (with normalization, if needed).[24][25][26]
  • Weighted linear combination.

S(d)=αSsparse(d)+(1α)Sdense(d), α[0,1]. The choice of α can be fixed or learned (per-collection or per-query).[27]

  • Score normalization. For CombSUM/CombMNZ, methods like min-max or z-score are often used to align scales;[28] alternatively, RRF relies only on ranks.
  • Dynamic/adaptive weighting. Query routing, query features, and LTR models for selecting/weighting channels; recent work shows that a simple learned combination often outperforms RRF and is less sensitive to normalization.[29]

Reranking and Multi-Stage Pipelines

Hybrid systems are typically built as retrieval → fusion → rerank pipelines. Reranking is performed using:

  • Cross-encoders (BERT/T5). The most accurate but computationally expensive: MonoBERT/MonoT5 for reordering the top-N candidates.[30][31]
  • Late-interaction as a reranker. The ColBERT family can also act as a reranker; modern accelerators (PLAID, WARP) reduce latency without quality loss.[32][33]

The quality ↔ latency/cost trade-off is particularly important in RAG applications and under strict SLAs (see p95/p99 tail latencies).[34]

Evaluation on Benchmarks

  • BEIR. A unified set of heterogeneous collections/tasks for zero-shot/out-of-domain evaluation of retrievers (e.g., TREC‑COVID, NFCorpus, NQ, HotpotQA, FiQA‑2018, DBPedia-entity, ArguAna, Webis-Touché-2020, FEVER/Climate-FEVER, Scidocs, SciFact, CQADupStack, etc.).[35]
  • TREC Deep Learning / MS MARCO. Classic resources for training/evaluating retrievers and rerankers on large-scale data.[36][37][38]
  • Quality metrics. nDCG@k, Recall@k, MRR; for performance: latency p50/p95/p99, QPS; for operations: memory/cost (CPU/GPU, index).[39][40]
  • Ablation studies. It is recommended to measure the contribution of each channel/weight and the sensitivity to parameters like k in RRF and α in linear combination; evaluate robustness to paraphrasing and OOD shifts.[41][42]

Engineering Aspects and Production Practices

  • Indexes and ANN. FAISS (Flat/HNSW/IVF-PQ), HNSW, and ScaNN for MIPS/cosine similarity.[43][44][45]
  • IR Stack. Lucene/Anserini/Pyserini for sparse, dense, and hybrid pipelines; 'turnkey' reproducibility on BEIR.[46][47]
  • Vector Databases and Search Engines. Qdrant, Weaviate, pgvector/PostgreSQL, Vespa, and Elasticsearch/OpenSearch have native modes for hybrid search (BM25F+vector) and/or RRF/linear combination.[48][49][50][51][52]
  • RAG Pattern. Architecture: retrieval → fusion → rerank → LLM context with token limits and source tracing.[53]
  • Index updates, deduplication, tokenization. It is important to align tokenization between BM25 and the vectorizer; scores should be calibrated (normalized/scaled) before fusion.[54]

Limitations and Open Questions

  • Transferability and Multilingualism. Dense models (GTR/E5) improve transfer but are sensitive to domain/language; sparse models (SPLADE) are often more robust on OOD.[55][56]
  • Integration with LLMs and Hallucinations. Hybrid retrieval reduces omissions and noise in RAG contexts but does not completely eliminate hallucinations, requiring strict rerankers and source filtering.[57]
  • Cost and Privacy. Storage of multi-vector indexes, compression, encryption, and on-premise stacks; TCO assessment.
  • Trends. HyDE/doc2query/PRF as document/query expansion;[58][59] learning to blend (per-query α), more efficient late-interaction models (PLAID/WARP), long documents, and multi-vector indexes.[60][61]

Comparative Table of Methods

As of 2025-09-10 (example on the BEIR trec-covid collection; nDCG@10 / Recall@100):[62]

Comparison of methods on trec-covid
Method Type (sparse/dense/hybrid) Concept/Model Fusion Scheme Reranker nDCG@10 / R@100 Latency (rel.) Sources
BM25 sparse Exact term matching (PRF/BM25) 0.595 / 0.109 very low [63][64]
SPLADE++ (ED) sparse (learned) Sparse term expansion/weighting 0.727 / 0.128 low–medium [65][66]
Contriever (MS MARCO FT) dense Bi-encoder with contrastive learning 0.596 / 0.091 medium [67][68]
BGE-base-en-v1.5 dense Strong general-purpose embedder 0.781 / 0.141 medium [69]
Cohere embed-english-v3.0 dense Production-grade text embedding model 0.818 / 0.159 medium [70]
BM25 + dense (e.g., BM25+BGE) hybrid Parallel retrieval + list fusion RRF (k≈60) or weighted combination opt.: MonoT5/ColBERT (varies by implementation; typically > best single channel) medium [71][72][73]

Note: The last row illustrates the scheme; exact numbers depend on the choice of embedder, normalization, and fusion parameters (see sources and reproducible Pyserini scripts).

Literature

  • Manning, C.D., Raghavan, P., Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. ISBN 978‑0521865715.
  • Robertson, S., Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval 3(4):333–389. DOI:10.1561/1500000019.
  • Lin, J. et al. (2021). Pyserini: A Python Toolkit for Reproducible IR. SIGIR.
  • Järvelin, K., Kekäläinen, J. (2002). Cumulated Gain‑Based Evaluation of IR Techniques. Information Retrieval 6:241–256. DOI:10.1023/A:1016043826386.
  • Dean, J., Barroso, L.A. (203). The Tail at Scale. CACM 56(2):74–80. DOI:10.1145/2408776.2408794.

Notes

  1. Robertson, S., Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389. DOI:10.1561/1500000019.
  2. Cormack, G.V., Clarke, C.L.A., Büttcher, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR 2009, 758–759. PDF.
  3. Bruch, S., Gai, S., Ingber, A. (2023). An Analysis of Fusion Functions for Hybrid Retrieval. ACM TOIS 42(1):1–35. DOI:10.1145/3596512 • arXiv:2210.11934.
  4. Manning, C.D., Raghavan, P., Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. ISBN 978‑0521865715 (see chapters on TF‑IDF, evaluation, and the vocabulary mismatch problem).
  5. Izacard, G. et al. (2022). Unsupervised Dense Information Retrieval with Contrastive Learning (Contriever). TACL 10:1089–1108. arXiv:2112.09118.
  6. Wang, L. et al. (2022/2024). Text Embeddings by Weakly‑Supervised Contrastive Pre‑training (E5). arXiv:2212.03533.
  7. Robertson, S., Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. DOI:10.1561/1500000019.
  8. Formal, T., Piwowarski, B., Clinchant, S. (2021). SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. arXiv:2107.05720.
  9. Formal, T. et al. (2022). Making Sparse Neural IR Models More Effective. Findings of EMNLP. arXiv:2205.04733.
  10. Formal, T. et al. (2024). SPLADE‑v3: New baselines for SPLADE. arXiv:2403.06789.
  11. Lin, J., Ma, X. (2021). A Few Brief Notes on DeepImpact, COIL, and uniCOIL. arXiv:2106.14807.
  12. Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open‑Domain QA. EMNLP. arXiv:2004.04906.
  13. Xiong, L. et al. (2021). Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval (ANCE). ICLR. arXiv:2007.00808.
  14. Izacard, G. et al. (2022). TACL. arXiv:2112.09118.
  15. Ni, J. et al. (2021/2022). Large Dual Encoders Are Generalizable Retrievers (GTR). EMNLP. arXiv:2112.07899.
  16. Wang, L. et al. (2022/2024). arXiv:2212.03533.
  17. Khattab, O., Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR. arXiv:2004.12832.
  18. Santhanam, K. et al. (2022). ColBERTv2 & PLAID. NAACL/ArXiv. arXiv:2112.01488; arXiv:2205.09707.
  19. Scheerer, J.L. et al. (2025). WARP: An Efficient Engine for Multi‑Vector Retrieval. arXiv:2501.17788.
  20. Lin, J. et al. (2021). Pyserini: A Python Toolkit for Reproducible IR with Sparse and Dense Representations. SIGIR. PDF.
  21. Cormack, G.V., Clarke, C.L.A., Büttcher, S. (2009). SIGIR. PDF.
  22. Elastic Docs. Reciprocal Rank Fusion. (accessed 2025‑09‑10). elastic.co/docs/.../reciprocal-rank-fusion.
  23. OpenSearch Docs. Score ranker processor (RRF). (accessed 2025‑09‑10). docs.opensearch.org/.../score-ranker-processor/.
  24. Fox, E.A., Shaw, J.A. (1994). Combination of Multiple Searches. TREC‑2, NIST SP 500‑215, 243–252. PDF.
  25. Lee, J.H. (1997). Analyses of Multiple Evidence Combination. SIGIR, 267–276. DOI:10.1145/258525.258587.
  26. Hsu, D.F., Taksa, I. (2005). Comparing Rank and Score Combination Methods for Data Fusion in IR. (Tech. report). PDF.
  27. Bruch, S., Gai, S., Ingber, A. (2023). TOIS. DOI:10.1145/3596512.
  28. Hsu, D.F., Taksa, I. (2005). See above.
  29. Bruch, S., Gai, S., Ingber, A. (2023). TOIS. DOI:10.1145/3596512.
  30. Nogueira, R., Cho, K. (2019). Passage Re‑ranking with BERT. arXiv:1901.04085.
  31. Nogueira, R., Jiang, Z., Lin, J. (2020). Document Ranking with a Pretrained Sequence‑to‑Sequence Model (MonoT5). Findings of EMNLP. arXiv:2003.06713.
  32. Santhanam, K. et al. (2022). arXiv:2205.09707.
  33. Scheerer, J.L. et al. (2025). arXiv:2501.17788.
  34. Dean, J., Barroso, L.A. (2013). The Tail at Scale. CACM 56(2):74–80. DOI:10.1145/2408776.2408794.
  35. Thakur, N. et al. (2021). BEIR: A Heterogeneous Benchmark for Zero‑shot Evaluation of IR Models. NeurIPS Datasets & Benchmarks. arXiv:2104.08663.
  36. Craswell, N. et al. (2020). Overview of the TREC 2019 Deep Learning Track. arXiv:2003.07820.
  37. Craswell, N. et al. (2021). Overview of the TREC 2020 Deep Learning Track. arXiv:2102.07662.
  38. Bajaj, P. et al. (2016). MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268.
  39. Järvelin, K., Kekäläinen, J. (2002). Cumulated Gain‑Based Evaluation of IR Techniques. Information Retrieval 6:241–256. DOI:10.1023/A:1016043826386.
  40. Dean, J., Barroso, L.A. (2013). CACM. DOI:10.1145/2408776.2408794.
  41. Bruch, S. et al. (2023). DOI:10.1145/3596512.
  42. Ni, J. et al. (2021/2022). arXiv:2112.07899.
  43. Johnson, J., Douze, M., Jégou, H. (2017). Billion‑scale Similarity Search with GPUs (FAISS). arXiv:1702.08734.
  44. Malkov, Y., Yashunin, D. (2020). HNSW. IEEE TPAMI 42(4):824–836. DOI:10.1109/TPAMI.2018.2889473.
  45. Guo, R. et al. (2020). ScaNN: Efficient Vector Similarity Search at Scale. arXiv:1908.10396.
  46. Yang, P., Fang, H., Lin, J. (2018). Anserini: Reproducible IR Research with Lucene. JDIQ 10(4):1–20. DOI:10.1145/3239571.
  47. Lin, J. et al. (2021). SIGIR. PDF.
  48. Qdrant Docs. Hybrid queries (RRF, DBSF). (accessed 2025‑09‑10). qdrant.tech/.../hybrid-queries/.
  49. Weaviate Docs. Hybrid search. (accessed 2025‑09‑10). docs.weaviate.io/weaviate/search/hybrid.
  50. pgvector GitHub. (accessed 2025‑09‑10). github.com/pgvector/pgvector.
  51. Vespa Docs. Hybrid Text Search Tutorial. (accessed 2025‑09‑10). docs.vespa.ai/.../hybrid-search.html.
  52. Elastic Docs. Reciprocal Rank Fusion. (accessed 2025‑09‑10). elastic.co/docs/.../rrf.
  53. Lewis, P. et al. (2020). Retrieval‑Augmented Generation for Knowledge‑Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.
  54. Hsu, D.F., Taksa, I. (2005). See above.
  55. Ni, J. et al. (2021/2022). arXiv:2112.07899.
  56. Formal, T. et al. (2021, 2022, 2024). arXiv:2107.05720; 2205.04733; 2403.06789.
  57. Lewis, P. et al. (2020). arXiv:2005.11401.
  58. Gao, L. et al. (2023). Precise Zero‑Shot Dense Retrieval without Relevance Labels (HyDE). ACL. arXiv:2212.10496.
  59. Nogueira, R. et al. (2019). Document Expansion by Query Prediction. arXiv:1904.08375; docTTTTTquery. PDF.
  60. Santhanam, K. et al. (2022). arXiv:2205.09707.
  61. Scheerer, J.L. et al. (2025). arXiv:2501.17788.
  62. Pyserini BEIR Regressions (accessed 2025‑09‑10): results for trec-covid for BM25/SPLADE/Contriever/BGE/Cohere. castorini.github.io/pyserini/2cr/beir.html.
  63. Robertson, S., Zaragoza, H. (2009). DOI:10.1561/1500000019.
  64. Pyserini BEIR. See link above.
  65. Formal, T. et al. (2021, 2022). arXiv:2107.05720; 2205.04733.
  66. Pyserini BEIR.
  67. Izacard, G. et al. (2022). arXiv:2112.09118.
  68. Pyserini BEIR.
  69. Pyserini BEIR.
  70. Pyserini BEIR.
  71. Cormack et al. (2009). SIGIR. RRF.
  72. Bruch et al. (2023). TOIS.
  73. Elastic/OpenSearch RRF Docs.