RAG patterns
RAG Patterns are a set of architectural and methodological approaches for building Retrieval-Augmented Generation (RAG) systems. These patterns are designed to address fundamental problems of large language models (LLMs), such as hallucinations, outdated knowledge, and a lack of domain specificity, by integrating LLMs with external, dynamically accessible data sources[1]. The evolution of RAG has progressed from simple linear pipelines to complex modular and agentic systems[2].
Core RAG Patterns
As the technology has evolved, numerous RAG patterns have emerged, each addressing specific challenges and involving trade-offs between quality, speed, and cost.
- Classic RAG — the basic approach where a user query is vectorized to find relevant fragments (chunks) in a vector database; the retrieved chunks are then fed into an LLM along with the question to generate an answer[1].
- Multi-Query RAG — the LLM generates several paraphrased or refined versions of the original query; a search is performed for all variants, and the results are merged, which increases recall[3].
- HyDE (Hypothetical Document Expansion) — addresses the "semantic gap" between a short query and long documents. The LLM first generates a "hypothetical" answer document, and its embedding is then used for search, often improving retrieval quality[4].
- Hybrid Retrieval — a combination of semantic (vector) search and lexical (BM25) search. Hybrid schemes have become standard for production systems: vector search covers semantic matches, while BM25 finds exact terms, IDs, or acronyms; results are combined through fusion[5][6][7].
- Re-ranking — a two-stage process: a fast retriever returns a set of candidates (e.g., top 100), then a cross-encoder (or another reranker) recalculates relevance and selects the best ones (e.g., top 5) for the LLM[8][9].
- Query Routing — in systems with multiple heterogeneous data sources (different indexes, databases, APIs), the query is directed to the best source by a router (an LLM-selector or a classifier); includes fallback strategies[10].
- Agentic/Web RAG — the LLM acts as an agent: it decomposes complex questions, plans iterations, and uses tools (vector search, web search) with feedback. A typical implementation is the ReAct paradigm[11]; for web-oriented collection and mandatory citation, see WebGPT[12].
Related and Emerging Paradigms
- GraphRAG — uses a knowledge graph as both a source and a context selection mechanism; search traverses the structure of relationships between entities and text, improving interpretability and performance on multi-hop questions[13][14].
- MM-RAG (Multimodal RAG) — works with text and visual sources (scans, diagrams, tables). Example: VisRAG demonstrates VLM-oriented retrieval and generation on multimodal documents[15].
- Context Packing — methods for integrating retrieved chunks into the prompt: Stuff, Map-Reduce, Refine, Tree-of-Chunks (RAPTOR)[16].
Comparative Table of Patterns
| Pattern | When to Use | Impact on Quality | Cost / Latency | Risks and Limitations |
|---|---|---|---|---|
| Classic RAG | PoCs and simple Q&A over a homogeneous knowledge base | Baseline level; highly dependent on embeddings[1] | Low | Sensitivity to wording; risk of irrelevant context |
| Hybrid Retrieval | In most production scenarios; many codes, acronyms, or IDs | Increases recall; covers exact terms[5][6][7] | Low/Medium | Tuning fusion weights; requires two indexes |
| Re-ranking | Critical when high precision is important | Significant boost in precision for top-k[8][9] | Medium/High | Additional latency/cost |
| Multi-Query | Short or multi-faceted queries | Increases recall[3] | Medium | Redundant or noisy paraphrases |
| HyDE | Short or ambiguous queries with a large "semantic gap" | Improves zero-shot retrieval quality[4] | Medium | Depends on the quality of the "hypothetical" text |
| Query Routing | Multiple sources (doc base, SQL, API, web) | Improves relevance by selecting the correct source[10] | Medium | Routing error = search failure |
| Agentic/Web RAG | Complex, exploratory, multi-step queries | Solves tasks beyond a linear pipeline[11][12] | High | Complexity, risk of loops; requires guardrails |
Practical Implementation and Architecture
Implementation Stages
- Proof of Concept (PoC): Start with Classic RAG on a limited but representative dataset to validate embedding quality and basic retrieval[1].
- Minimum Viable Product (MVP): Implement Hybrid Retrieval and Re-ranking as they offer the best effort-to-impact ratio[5][8].
- Production: Add query transformations (HyDE, Multi-Query) and Query Routing if needed; set up observability (logging for retrieval, reranking, and responses) and A/B testing[3][10].
Key Components
- Chunking: One of the most critical factors for quality. Naive fixed-size chunking often breaks semantic units. Structure-aware (based on markup) or recursive splitters (paragraph → sentence → word) are recommended[17][18].
- Embeddings and Metadata: Store metadata with each chunk, such as document_id, page/section, title, and dates; this is essential for filtering and accurate source citation.
- Hybrid Retrieval and Re-ranking: Use BM25+vector with fusion (or RRF), followed by a cross-encoder to re-rank a small pool of candidates[5][6][8].
- Context Packing: Choose Map-Reduce, Refine, or Tree-of-Chunks for long corpora[16][18].
Common Mistakes (Anti-Patterns)
- Vector-only search without BM25 → fails on codes, IDs, or acronyms[5][7].
- Chunks too large or too small → loss of context or "diluted" embeddings[17].
- No re-ranking in production → the LLM receives noisy context[8].
- No observability and source tracing → impossible to debug the causes of errors (see RAG evaluation).
Quality Evaluation and Metrics
Evaluation is conducted at the retrieval level (offline) and the end-to-end generation level.
Retriever Metrics
- Hit Rate, Recall@k, MRR — coverage and position of relevant documents.
- Context Precision & Recall — measures how much of the retrieved context is relevant (free of "noise") and covers all necessary information (implemented in RAGAS)[19].
Generator Metrics (End-to-End)
- Faithfulness / Groundedness — alignment of the answer with the provided context.
- Answer Relevancy — alignment with the original question.
Open-source frameworks are used to automate these metrics: RAGAS, TruLens (the RAG triad: context relevance, groundedness, answer relevance), and DeepEval[20][21].
See Also
- Retrieval-Augmented Generation (RAG)
- Vector database
- Embedding
- AI agent
- GraphRAG
- MM-RAG
- LLM Evaluation and Benchmarks
References
- Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.
- Fan, W., Ding, Y., et al. (2024). A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. KDD. DOI:10.1145/3637528.3671470; arXiv:2405.06211.
- Gao, L., Ma, X., Lin, J., Callan, J. (2023). Precise Zero‑Shot Dense Retrieval without Relevance Labels (HyDE). ACL 2023. ACL Anthology; arXiv:2212.10496.
- Nogueira, R., Cho, K. (2019). Passage Re‑ranking with BERT. arXiv:1901.04085.
- Weaviate Docs. Hybrid search (BM25+Vector). [1].
- Qdrant Docs. Hybrid Queries. [2].
- Milvus Docs. Full‑Text Search / Hybrid Search. [3] / [4].
- LangChain Docs. MultiQueryRetriever. [5].
- Cohere Docs. Rerank — best practices. [6].
- LlamaIndex Docs. Routing (query routers/selectors). [7].
- Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR. arXiv:2210.03629.
- Nakano, R., et al. (2021). WebGPT: Browser‑assisted question‑answering with human feedback. arXiv:2112.09332.
- Microsoft Research Blog. GraphRAG: Unlocking LLM discovery on narrative private data. (2024). [8].
- Microsoft Research. Project GraphRAG. (2024). [9].
- Yu, S., et al. (2024). VisRAG: Vision‑based Retrieval‑augmented Generation on Multi‑modality Documents. arXiv:2410.10594.
- Sarthi, P., et al. (2024). RAPTOR: Recursive Abstractive Processing for Tree‑Organized Retrieval. arXiv:2401.18059.
- Es, S., et al. (2024). RAGAs: Automated Evaluation of Retrieval Augmented Generation. EACL (Demo). [10].
- TruLens Docs. RAG Triad. [11].
- DeepEval (GitHub). The LLM Evaluation Framework. [12].
- LangChain Docs. RecursiveCharacterTextSplitter. [13].
- LlamaIndex Docs. HierarchicalNodeParser; Response Synthesis (Tree/Refine). [14]; [15].
Notes
- ↑ 1.0 1.1 1.2 1.3 Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.
- ↑ Fan, W., Ding, Y., et al. (2024). A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. KDD. DOI:10.1145/3637528.3671470; arXiv:2405.06211.
- ↑ 3.0 3.1 3.2 LangChain Docs. MultiQueryRetriever. https://python.langchain.com/docs/how_to/MultiQueryRetriever/
- ↑ 4.0 4.1 Gao, L., Ma, X., Lin, J., Callan, J. (2023). Precise Zero‑Shot Dense Retrieval without Relevance Labels. ACL 2023. arXiv:2212.10496; ACL Anthology: 2023.acl‑long.99.
- ↑ 5.0 5.1 5.2 5.3 5.4 Weaviate Docs. Hybrid search (BM25+Vector). https://docs.weaviate.io/weaviate/concepts/search/hybrid-search
- ↑ 6.0 6.1 6.2 Qdrant Docs. Hybrid Queries. https://qdrant.tech/documentation/concepts/hybrid-queries/
- ↑ 7.0 7.1 7.2 Milvus Docs. Full‑Text Search and Hybrid Search. https://milvus.io/docs/full-text-search.md; https://milvus.io/docs/hybrid_search_with_milvus.md
- ↑ 8.0 8.1 8.2 8.3 8.4 Nogueira, R., Cho, K. (2019). Passage Re‑ranking with BERT. arXiv:1901.04085.
- ↑ 9.0 9.1 Cohere Docs. Rerank — best practices. https://docs.cohere.com/docs/reranking-best-practices
- ↑ 10.0 10.1 10.2 LlamaIndex Docs. Routing (query routers/selectors). https://docs.llamaindex.ai/en/stable/module_guides/querying/router/
- ↑ 11.0 11.1 Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629.
- ↑ 12.0 12.1 Nakano, R., et al. (2021). WebGPT: Browser‑assisted question‑answering with human feedback. arXiv:2112.09332.
- ↑ Microsoft Research Blog. GraphRAG: Unlocking LLM discovery on narrative private data. 2024. https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
- ↑ Microsoft Research. Project GraphRAG. https://www.microsoft.com/en-us/research/project/graphrag/
- ↑ Yu, S., et al. (2024). VisRAG: Vision‑based Retrieval‑augmented Generation on Multi‑modality Documents. arXiv:2410.10594; OpenReview: zG459X3Xge.
- ↑ 16.0 16.1 Sarthi, P., et al. (2024). RAPTOR: Recursive Abstractive Processing for Tree‑Organized Retrieval. arXiv:2401.18059.
- ↑ 17.0 17.1 LangChain Docs. RecursiveCharacterTextSplitter. https://python.langchain.com/docs/how_to/recursive_text_splitter/
- ↑ 18.0 18.1 LlamaIndex Docs. HierarchicalNodeParser and Tree Summarization. https://docs.llamaindex.ai/en/stable/api/llama_index.core.node_parser.HierarchicalNodeParser.html; https://docs.llamaindex.ai/en/stable/examples/low_level/response_synthesis/
- ↑ Es, S., et al. (2024). RAGAs: Automated Evaluation of Retrieval Augmented Generation. EACL (Demo). https://aclanthology.org/2024.eacl-demo.16/
- ↑ TruLens Docs. RAG Triad. https://www.trulens.org/getting_started/core_concepts/rag_triad/
- ↑ DeepEval (GitHub). https://github.com/confident-ai/deepeval