RAG patterns

From Systems Analysis Wiki
Jump to navigation Jump to search

RAG Patterns are a set of architectural and methodological approaches for building Retrieval-Augmented Generation (RAG) systems. These patterns are designed to address fundamental problems of large language models (LLMs), such as hallucinations, outdated knowledge, and a lack of domain specificity, by integrating LLMs with external, dynamically accessible data sources[1]. The evolution of RAG has progressed from simple linear pipelines to complex modular and agentic systems[2].

Core RAG Patterns

As the technology has evolved, numerous RAG patterns have emerged, each addressing specific challenges and involving trade-offs between quality, speed, and cost.

  • Classic RAG — the basic approach where a user query is vectorized to find relevant fragments (chunks) in a vector database; the retrieved chunks are then fed into an LLM along with the question to generate an answer[1].
  • Multi-Query RAG — the LLM generates several paraphrased or refined versions of the original query; a search is performed for all variants, and the results are merged, which increases recall[3].
  • HyDE (Hypothetical Document Expansion) — addresses the "semantic gap" between a short query and long documents. The LLM first generates a "hypothetical" answer document, and its embedding is then used for search, often improving retrieval quality[4].
  • Hybrid Retrieval — a combination of semantic (vector) search and lexical (BM25) search. Hybrid schemes have become standard for production systems: vector search covers semantic matches, while BM25 finds exact terms, IDs, or acronyms; results are combined through fusion[5][6][7].
  • Re-ranking — a two-stage process: a fast retriever returns a set of candidates (e.g., top 100), then a cross-encoder (or another reranker) recalculates relevance and selects the best ones (e.g., top 5) for the LLM[8][9].
  • Query Routing — in systems with multiple heterogeneous data sources (different indexes, databases, APIs), the query is directed to the best source by a router (an LLM-selector or a classifier); includes fallback strategies[10].
  • Agentic/Web RAG — the LLM acts as an agent: it decomposes complex questions, plans iterations, and uses tools (vector search, web search) with feedback. A typical implementation is the ReAct paradigm[11]; for web-oriented collection and mandatory citation, see WebGPT[12].
  • GraphRAG — uses a knowledge graph as both a source and a context selection mechanism; search traverses the structure of relationships between entities and text, improving interpretability and performance on multi-hop questions[13][14].
  • MM-RAG (Multimodal RAG) — works with text and visual sources (scans, diagrams, tables). Example: VisRAG demonstrates VLM-oriented retrieval and generation on multimodal documents[15].
  • Context Packing — methods for integrating retrieved chunks into the prompt: Stuff, Map-Reduce, Refine, Tree-of-Chunks (RAPTOR)[16].

Comparative Table of Patterns

Comparison of Key RAG Patterns
Pattern When to Use Impact on Quality Cost / Latency Risks and Limitations
Classic RAG PoCs and simple Q&A over a homogeneous knowledge base Baseline level; highly dependent on embeddings[1] Low Sensitivity to wording; risk of irrelevant context
Hybrid Retrieval In most production scenarios; many codes, acronyms, or IDs Increases recall; covers exact terms[5][6][7] Low/Medium Tuning fusion weights; requires two indexes
Re-ranking Critical when high precision is important Significant boost in precision for top-k[8][9] Medium/High Additional latency/cost
Multi-Query Short or multi-faceted queries Increases recall[3] Medium Redundant or noisy paraphrases
HyDE Short or ambiguous queries with a large "semantic gap" Improves zero-shot retrieval quality[4] Medium Depends on the quality of the "hypothetical" text
Query Routing Multiple sources (doc base, SQL, API, web) Improves relevance by selecting the correct source[10] Medium Routing error = search failure
Agentic/Web RAG Complex, exploratory, multi-step queries Solves tasks beyond a linear pipeline[11][12] High Complexity, risk of loops; requires guardrails

Practical Implementation and Architecture

Implementation Stages

  1. Proof of Concept (PoC): Start with Classic RAG on a limited but representative dataset to validate embedding quality and basic retrieval[1].
  2. Minimum Viable Product (MVP): Implement Hybrid Retrieval and Re-ranking as they offer the best effort-to-impact ratio[5][8].
  3. Production: Add query transformations (HyDE, Multi-Query) and Query Routing if needed; set up observability (logging for retrieval, reranking, and responses) and A/B testing[3][10].

Key Components

  • Chunking: One of the most critical factors for quality. Naive fixed-size chunking often breaks semantic units. Structure-aware (based on markup) or recursive splitters (paragraph → sentence → word) are recommended[17][18].
  • Embeddings and Metadata: Store metadata with each chunk, such as document_id, page/section, title, and dates; this is essential for filtering and accurate source citation.
  • Hybrid Retrieval and Re-ranking: Use BM25+vector with fusion (or RRF), followed by a cross-encoder to re-rank a small pool of candidates[5][6][8].
  • Context Packing: Choose Map-Reduce, Refine, or Tree-of-Chunks for long corpora[16][18].

Common Mistakes (Anti-Patterns)

  • Vector-only search without BM25 → fails on codes, IDs, or acronyms[5][7].
  • Chunks too large or too small → loss of context or "diluted" embeddings[17].
  • No re-ranking in production → the LLM receives noisy context[8].
  • No observability and source tracing → impossible to debug the causes of errors (see RAG evaluation).

Quality Evaluation and Metrics

Evaluation is conducted at the retrieval level (offline) and the end-to-end generation level.

Retriever Metrics

  • Hit Rate, Recall@k, MRR — coverage and position of relevant documents.
  • Context Precision & Recall — measures how much of the retrieved context is relevant (free of "noise") and covers all necessary information (implemented in RAGAS)[19].

Generator Metrics (End-to-End)

  • Faithfulness / Groundedness — alignment of the answer with the provided context.
  • Answer Relevancy — alignment with the original question.

Open-source frameworks are used to automate these metrics: RAGAS, TruLens (the RAG triad: context relevance, groundedness, answer relevance), and DeepEval[20][21].

See Also

References

  • Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.
  • Fan, W., Ding, Y., et al. (2024). A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. KDD. DOI:10.1145/3637528.3671470; arXiv:2405.06211.
  • Gao, L., Ma, X., Lin, J., Callan, J. (2023). Precise Zero‑Shot Dense Retrieval without Relevance Labels (HyDE). ACL 2023. ACL Anthology; arXiv:2212.10496.
  • Nogueira, R., Cho, K. (2019). Passage Re‑ranking with BERT. arXiv:1901.04085.
  • Weaviate Docs. Hybrid search (BM25+Vector). [1].
  • Qdrant Docs. Hybrid Queries. [2].
  • Milvus Docs. Full‑Text Search / Hybrid Search. [3] / [4].
  • LangChain Docs. MultiQueryRetriever. [5].
  • Cohere Docs. Rerank — best practices. [6].
  • LlamaIndex Docs. Routing (query routers/selectors). [7].
  • Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR. arXiv:2210.03629.
  • Nakano, R., et al. (2021). WebGPT: Browser‑assisted question‑answering with human feedback. arXiv:2112.09332.
  • Microsoft Research Blog. GraphRAG: Unlocking LLM discovery on narrative private data. (2024). [8].
  • Microsoft Research. Project GraphRAG. (2024). [9].
  • Yu, S., et al. (2024). VisRAG: Vision‑based Retrieval‑augmented Generation on Multi‑modality Documents. arXiv:2410.10594.
  • Sarthi, P., et al. (2024). RAPTOR: Recursive Abstractive Processing for Tree‑Organized Retrieval. arXiv:2401.18059.
  • Es, S., et al. (2024). RAGAs: Automated Evaluation of Retrieval Augmented Generation. EACL (Demo). [10].
  • TruLens Docs. RAG Triad. [11].
  • DeepEval (GitHub). The LLM Evaluation Framework. [12].
  • LangChain Docs. RecursiveCharacterTextSplitter. [13].
  • LlamaIndex Docs. HierarchicalNodeParser; Response Synthesis (Tree/Refine). [14]; [15].

Notes

  1. 1.0 1.1 1.2 1.3 Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.
  2. Fan, W., Ding, Y., et al. (2024). A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. KDD. DOI:10.1145/3637528.3671470; arXiv:2405.06211.
  3. 3.0 3.1 3.2 LangChain Docs. MultiQueryRetriever. https://python.langchain.com/docs/how_to/MultiQueryRetriever/
  4. 4.0 4.1 Gao, L., Ma, X., Lin, J., Callan, J. (2023). Precise Zero‑Shot Dense Retrieval without Relevance Labels. ACL 2023. arXiv:2212.10496; ACL Anthology: 2023.acl‑long.99.
  5. 5.0 5.1 5.2 5.3 5.4 Weaviate Docs. Hybrid search (BM25+Vector). https://docs.weaviate.io/weaviate/concepts/search/hybrid-search
  6. 6.0 6.1 6.2 Qdrant Docs. Hybrid Queries. https://qdrant.tech/documentation/concepts/hybrid-queries/
  7. 7.0 7.1 7.2 Milvus Docs. Full‑Text Search and Hybrid Search. https://milvus.io/docs/full-text-search.md; https://milvus.io/docs/hybrid_search_with_milvus.md
  8. 8.0 8.1 8.2 8.3 8.4 Nogueira, R., Cho, K. (2019). Passage Re‑ranking with BERT. arXiv:1901.04085.
  9. 9.0 9.1 Cohere Docs. Rerank — best practices. https://docs.cohere.com/docs/reranking-best-practices
  10. 10.0 10.1 10.2 LlamaIndex Docs. Routing (query routers/selectors). https://docs.llamaindex.ai/en/stable/module_guides/querying/router/
  11. 11.0 11.1 Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629.
  12. 12.0 12.1 Nakano, R., et al. (2021). WebGPT: Browser‑assisted question‑answering with human feedback. arXiv:2112.09332.
  13. Microsoft Research Blog. GraphRAG: Unlocking LLM discovery on narrative private data. 2024. https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
  14. Microsoft Research. Project GraphRAG. https://www.microsoft.com/en-us/research/project/graphrag/
  15. Yu, S., et al. (2024). VisRAG: Vision‑based Retrieval‑augmented Generation on Multi‑modality Documents. arXiv:2410.10594; OpenReview: zG459X3Xge.
  16. 16.0 16.1 Sarthi, P., et al. (2024). RAPTOR: Recursive Abstractive Processing for Tree‑Organized Retrieval. arXiv:2401.18059.
  17. 17.0 17.1 LangChain Docs. RecursiveCharacterTextSplitter. https://python.langchain.com/docs/how_to/recursive_text_splitter/
  18. 18.0 18.1 LlamaIndex Docs. HierarchicalNodeParser and Tree Summarization. https://docs.llamaindex.ai/en/stable/api/llama_index.core.node_parser.HierarchicalNodeParser.html; https://docs.llamaindex.ai/en/stable/examples/low_level/response_synthesis/
  19. Es, S., et al. (2024). RAGAs: Automated Evaluation of Retrieval Augmented Generation. EACL (Demo). https://aclanthology.org/2024.eacl-demo.16/
  20. TruLens Docs. RAG Triad. https://www.trulens.org/getting_started/core_concepts/rag_triad/
  21. DeepEval (GitHub). https://github.com/confident-ai/deepeval