RAG patterns

RAG Patterns are a set of architectural and methodological approaches for building Retrieval-Augmented Generation (RAG) systems. These patterns are designed to address fundamental problems of large language models (LLMs), such as hallucinations, outdated knowledge, and a lack of domain specificity, by integrating LLMs with external, dynamically accessible data sources^[1]. The evolution of RAG has progressed from simple linear pipelines to complex modular and agentic systems^[2].

Core RAG Patterns

As the technology has evolved, numerous RAG patterns have emerged, each addressing specific challenges and involving trade-offs between quality, speed, and cost.

Classic RAG — the basic approach where a user query is vectorized to find relevant fragments (chunks) in a vector database; the retrieved chunks are then fed into an LLM along with the question to generate an answer^[1].

Multi-Query RAG — the LLM generates several paraphrased or refined versions of the original query; a search is performed for all variants, and the results are merged, which increases recall^[3].

HyDE (Hypothetical Document Expansion) — addresses the "semantic gap" between a short query and long documents. The LLM first generates a "hypothetical" answer document, and its embedding is then used for search, often improving retrieval quality^[4].

Hybrid Retrieval — a combination of semantic (vector) search and lexical (BM25) search. Hybrid schemes have become standard for production systems: vector search covers semantic matches, while BM25 finds exact terms, IDs, or acronyms; results are combined through fusion^[5]^[6]^[7].

Re-ranking — a two-stage process: a fast retriever returns a set of candidates (e.g., top 100), then a cross-encoder (or another reranker) recalculates relevance and selects the best ones (e.g., top 5) for the LLM^[8]^[9].

Query Routing — in systems with multiple heterogeneous data sources (different indexes, databases, APIs), the query is directed to the best source by a router (an LLM-selector or a classifier); includes fallback strategies^[10].

Agentic/Web RAG — the LLM acts as an agent: it decomposes complex questions, plans iterations, and uses tools (vector search, web search) with feedback. A typical implementation is the ReAct paradigm^[11]; for web-oriented collection and mandatory citation, see WebGPT^[12].

Related and Emerging Paradigms

GraphRAG — uses a knowledge graph as both a source and a context selection mechanism; search traverses the structure of relationships between entities and text, improving interpretability and performance on multi-hop questions^[13]^[14].
MM-RAG (Multimodal RAG) — works with text and visual sources (scans, diagrams, tables). Example: VisRAG demonstrates VLM-oriented retrieval and generation on multimodal documents^[15].
Context Packing — methods for integrating retrieved chunks into the prompt: Stuff, Map-Reduce, Refine, Tree-of-Chunks (RAPTOR)^[16].

Comparative Table of Patterns

Comparison of Key RAG Patterns
Pattern	When to Use	Impact on Quality	Cost / Latency	Risks and Limitations
Classic RAG	PoCs and simple Q&A over a homogeneous knowledge base	Baseline level; highly dependent on embeddings^[1]	Low	Sensitivity to wording; risk of irrelevant context
Hybrid Retrieval	In most production scenarios; many codes, acronyms, or IDs	Increases recall; covers exact terms^[5]^[6]^[7]	Low/Medium	Tuning fusion weights; requires two indexes
Re-ranking	Critical when high precision is important	Significant boost in precision for top-k^[8]^[9]	Medium/High	Additional latency/cost
Multi-Query	Short or multi-faceted queries	Increases recall^[3]	Medium	Redundant or noisy paraphrases
HyDE	Short or ambiguous queries with a large "semantic gap"	Improves zero-shot retrieval quality^[4]	Medium	Depends on the quality of the "hypothetical" text
Query Routing	Multiple sources (doc base, SQL, API, web)	Improves relevance by selecting the correct source^[10]	Medium	Routing error = search failure
Agentic/Web RAG	Complex, exploratory, multi-step queries	Solves tasks beyond a linear pipeline^[11]^[12]	High	Complexity, risk of loops; requires guardrails

Practical Implementation and Architecture

Implementation Stages

Proof of Concept (PoC): Start with Classic RAG on a limited but representative dataset to validate embedding quality and basic retrieval^[1].
Minimum Viable Product (MVP): Implement Hybrid Retrieval and Re-ranking as they offer the best effort-to-impact ratio^[5]^[8].
Production: Add query transformations (HyDE, Multi-Query) and Query Routing if needed; set up observability (logging for retrieval, reranking, and responses) and A/B testing^[3]^[10].

Key Components

Chunking: One of the most critical factors for quality. Naive fixed-size chunking often breaks semantic units. Structure-aware (based on markup) or recursive splitters (paragraph → sentence → word) are recommended^[17]^[18].
Embeddings and Metadata: Store metadata with each chunk, such as document_id, page/section, title, and dates; this is essential for filtering and accurate source citation.
Hybrid Retrieval and Re-ranking: Use BM25+vector with fusion (or RRF), followed by a cross-encoder to re-rank a small pool of candidates^[5]^[6]^[8].
Context Packing: Choose Map-Reduce, Refine, or Tree-of-Chunks for long corpora^[16]^[18].

Common Mistakes (Anti-Patterns)

Vector-only search without BM25 → fails on codes, IDs, or acronyms^[5]^[7].
Chunks too large or too small → loss of context or "diluted" embeddings^[17].
No re-ranking in production → the LLM receives noisy context^[8].
No observability and source tracing → impossible to debug the causes of errors (see RAG evaluation).

Quality Evaluation and Metrics

Evaluation is conducted at the retrieval level (offline) and the end-to-end generation level.

Retriever Metrics

Hit Rate, Recall@k, MRR — coverage and position of relevant documents.
Context Precision & Recall — measures how much of the retrieved context is relevant (free of "noise") and covers all necessary information (implemented in RAGAS)^[19].

Generator Metrics (End-to-End)

Faithfulness / Groundedness — alignment of the answer with the provided context.
Answer Relevancy — alignment with the original question.

Open-source frameworks are used to automate these metrics: RAGAS, TruLens (the RAG triad: context relevance, groundedness, answer relevance), and DeepEval^[20]^[21].

References

Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.
Fan, W., Ding, Y., et al. (2024). A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. KDD. DOI:10.1145/3637528.3671470; arXiv:2405.06211.
Gao, L., Ma, X., Lin, J., Callan, J. (2023). Precise Zero‑Shot Dense Retrieval without Relevance Labels (HyDE). ACL 2023. ACL Anthology; arXiv:2212.10496.
Nogueira, R., Cho, K. (2019). Passage Re‑ranking with BERT. arXiv:1901.04085.
Weaviate Docs. Hybrid search (BM25+Vector). [1].
Qdrant Docs. Hybrid Queries. [2].
Milvus Docs. Full‑Text Search / Hybrid Search. [3] / [4].
LangChain Docs. MultiQueryRetriever. [5].
Cohere Docs. Rerank — best practices. [6].
LlamaIndex Docs. Routing (query routers/selectors). [7].
Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR. arXiv:2210.03629.
Nakano, R., et al. (2021). WebGPT: Browser‑assisted question‑answering with human feedback. arXiv:2112.09332.
Microsoft Research Blog. GraphRAG: Unlocking LLM discovery on narrative private data. (2024). [8].
Microsoft Research. Project GraphRAG. (2024). [9].
Yu, S., et al. (2024). VisRAG: Vision‑based Retrieval‑augmented Generation on Multi‑modality Documents. arXiv:2410.10594.
Sarthi, P., et al. (2024). RAPTOR: Recursive Abstractive Processing for Tree‑Organized Retrieval. arXiv:2401.18059.
Es, S., et al. (2024). RAGAs: Automated Evaluation of Retrieval Augmented Generation. EACL (Demo). [10].
TruLens Docs. RAG Triad. [11].
DeepEval (GitHub). The LLM Evaluation Framework. [12].
LangChain Docs. RecursiveCharacterTextSplitter. [13].
LlamaIndex Docs. HierarchicalNodeParser; Response Synthesis (Tree/Refine). [14]; [15].

Notes

↑ ^1.0 ^1.1 ^1.2 ^1.3 Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.
↑ Fan, W., Ding, Y., et al. (2024). A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. KDD. DOI:10.1145/3637528.3671470; arXiv:2405.06211.
↑ ^3.0 ^3.1 ^3.2 LangChain Docs. MultiQueryRetriever. https://python.langchain.com/docs/how_to/MultiQueryRetriever/
↑ ^4.0 ^4.1 Gao, L., Ma, X., Lin, J., Callan, J. (2023). Precise Zero‑Shot Dense Retrieval without Relevance Labels. ACL 2023. arXiv:2212.10496; ACL Anthology: 2023.acl‑long.99.
↑ ^5.0 ^5.1 ^5.2 ^5.3 ^5.4 Weaviate Docs. Hybrid search (BM25+Vector). https://docs.weaviate.io/weaviate/concepts/search/hybrid-search
↑ ^6.0 ^6.1 ^6.2 Qdrant Docs. Hybrid Queries. https://qdrant.tech/documentation/concepts/hybrid-queries/
↑ ^7.0 ^7.1 ^7.2 Milvus Docs. Full‑Text Search and Hybrid Search. https://milvus.io/docs/full-text-search.md; https://milvus.io/docs/hybrid_search_with_milvus.md
↑ ^8.0 ^8.1 ^8.2 ^8.3 ^8.4 Nogueira, R., Cho, K. (2019). Passage Re‑ranking with BERT. arXiv:1901.04085.
↑ ^9.0 ^9.1 Cohere Docs. Rerank — best practices. https://docs.cohere.com/docs/reranking-best-practices
↑ ^10.0 ^10.1 ^10.2 LlamaIndex Docs. Routing (query routers/selectors). https://docs.llamaindex.ai/en/stable/module_guides/querying/router/
↑ ^11.0 ^11.1 Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629.
↑ ^12.0 ^12.1 Nakano, R., et al. (2021). WebGPT: Browser‑assisted question‑answering with human feedback. arXiv:2112.09332.
↑ Microsoft Research Blog. GraphRAG: Unlocking LLM discovery on narrative private data. 2024. https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
↑ Microsoft Research. Project GraphRAG. https://www.microsoft.com/en-us/research/project/graphrag/
↑ Yu, S., et al. (2024). VisRAG: Vision‑based Retrieval‑augmented Generation on Multi‑modality Documents. arXiv:2410.10594; OpenReview: zG459X3Xge.
↑ ^16.0 ^16.1 Sarthi, P., et al. (2024). RAPTOR: Recursive Abstractive Processing for Tree‑Organized Retrieval. arXiv:2401.18059.
↑ ^17.0 ^17.1 LangChain Docs. RecursiveCharacterTextSplitter. https://python.langchain.com/docs/how_to/recursive_text_splitter/
↑ ^18.0 ^18.1 LlamaIndex Docs. HierarchicalNodeParser and Tree Summarization. https://docs.llamaindex.ai/en/stable/api/llama_index.core.node_parser.HierarchicalNodeParser.html; https://docs.llamaindex.ai/en/stable/examples/low_level/response_synthesis/
↑ Es, S., et al. (2024). RAGAs: Automated Evaluation of Retrieval Augmented Generation. EACL (Demo). https://aclanthology.org/2024.eacl-demo.16/
↑ TruLens Docs. RAG Triad. https://www.trulens.org/getting_started/core_concepts/rag_triad/
↑ DeepEval (GitHub). https://github.com/confident-ai/deepeval

[lewis2020-1] 1.0 ^1.1 ^1.2 ^1.3 Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.

[survey2024-2] Fan, W., Ding, Y., et al. (2024). A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. KDD. DOI:10.1145/3637528.3671470; arXiv:2405.06211.

[langchain-multiquery-3] 3.0 ^3.1 ^3.2 LangChain Docs. MultiQueryRetriever. https://python.langchain.com/docs/how_to/MultiQueryRetriever/

[hyde-4] 4.0 ^4.1 Gao, L., Ma, X., Lin, J., Callan, J. (2023). Precise Zero‑Shot Dense Retrieval without Relevance Labels. ACL 2023. arXiv:2212.10496; ACL Anthology: 2023.acl‑long.99.

[weaviate-hybrid-5] 5.0 ^5.1 ^5.2 ^5.3 ^5.4 Weaviate Docs. Hybrid search (BM25+Vector). https://docs.weaviate.io/weaviate/concepts/search/hybrid-search

[qdrant-hybrid-6] 6.0 ^6.1 ^6.2 Qdrant Docs. Hybrid Queries. https://qdrant.tech/documentation/concepts/hybrid-queries/

[milvus-fulltext-7] 7.0 ^7.1 ^7.2 Milvus Docs. Full‑Text Search and Hybrid Search. https://milvus.io/docs/full-text-search.md; https://milvus.io/docs/hybrid_search_with_milvus.md

[nogueira2019-8] 8.0 ^8.1 ^8.2 ^8.3 ^8.4 Nogueira, R., Cho, K. (2019). Passage Re‑ranking with BERT. arXiv:1901.04085.

[cohere-rerank-9] 9.0 ^9.1 Cohere Docs. Rerank — best practices. https://docs.cohere.com/docs/reranking-best-practices

[llama-router-10] 10.0 ^10.1 ^10.2 LlamaIndex Docs. Routing (query routers/selectors). https://docs.llamaindex.ai/en/stable/module_guides/querying/router/

[react-11] 11.0 ^11.1 Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629.

[webgpt-12] 12.0 ^12.1 Nakano, R., et al. (2021). WebGPT: Browser‑assisted question‑answering with human feedback. arXiv:2112.09332.

[graphrag-13] Microsoft Research Blog. GraphRAG: Unlocking LLM discovery on narrative private data. 2024. https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

[graphrag-project-14] Microsoft Research. Project GraphRAG. https://www.microsoft.com/en-us/research/project/graphrag/

[visrag-15] Yu, S., et al. (2024). VisRAG: Vision‑based Retrieval‑augmented Generation on Multi‑modality Documents. arXiv:2410.10594; OpenReview: zG459X3Xge.

[raptor-16] 16.0 ^16.1 Sarthi, P., et al. (2024). RAPTOR: Recursive Abstractive Processing for Tree‑Organized Retrieval. arXiv:2401.18059.

[rcsplit-17] 17.0 ^17.1 LangChain Docs. RecursiveCharacterTextSplitter. https://python.langchain.com/docs/how_to/recursive_text_splitter/

[llama-hier-18] 18.0 ^18.1 LlamaIndex Docs. HierarchicalNodeParser and Tree Summarization. https://docs.llamaindex.ai/en/stable/api/llama_index.core.node_parser.HierarchicalNodeParser.html; https://docs.llamaindex.ai/en/stable/examples/low_level/response_synthesis/

[ragas-19] Es, S., et al. (2024). RAGAs: Automated Evaluation of Retrieval Augmented Generation. EACL (Demo). https://aclanthology.org/2024.eacl-demo.16/

[trulens-20] TruLens Docs. RAG Triad. https://www.trulens.org/getting_started/core_concepts/rag_triad/

[deepeval-21] DeepEval (GitHub). https://github.com/confident-ai/deepeval

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

RAG patterns

Contents

Core RAG Patterns

Related and Emerging Paradigms

Comparative Table of Patterns

Practical Implementation and Architecture

Implementation Stages

Key Components

Common Mistakes (Anti-Patterns)

Quality Evaluation and Metrics

Retriever Metrics

Generator Metrics (End-to-End)

See Also

References

Notes

Navigation menu

RAG patterns

Core RAG Patterns

Related and Emerging Paradigms

Comparative Table of Patterns

Practical Implementation and Architecture

Implementation Stages

Key Components

Common Mistakes (Anti-Patterns)

Quality Evaluation and Metrics

Retriever Metrics

Generator Metrics (End-to-End)

See Also

References

Notes

Navigation menu

Search