Retrieval-augmented generation (RAG)

From Systems Analysis Wiki
Jump to navigation Jump to search

Retrieval-Augmented Generation (RAG) is an artificial intelligence method in which a generative language model (LLM) is provided with access to external sources of information to improve the accuracy and reliability of its responses. In other words, before generating an answer, the model retrieves relevant data (e.g., from a document base, website, or database) and uses the retrieved information to formulate its response[1][2]. This approach provides knowledge augmentation from up-to-date sources and helps overcome the limitations of LLMs themselves, related to their limited "memory" and outdated information[3]. A RAG system can cite specific documents (for instance, in the form of footnotes) in its generated response, which increases transparency and allows the user to verify the facts[1]. As a result, the risk of hallucinations—cases where the model confidently provides false information—is reduced[1][3]. RAG expands an LLM's knowledge base to a virtually unlimited size and allows models to use the most recent data without retraining[4].

Origins and Development

The idea of combining information retrieval with automatic answer generation emerged long before the advent of modern LLMs. As early as the 1970s, attempts were made to create question-answering systems that searched for answers to given questions in text databases[1]. In the 1990s, the web service Ask Jeeves popularized natural language answer retrieval, and in 2011, the IBM Watson system demonstrated the capabilities of AI by winning the game show Jeopardy! against human contestants[1].

The modern stage of development is associated with the introduction of neural network language models: Retrieval-Augmented Generation as a distinct approach was proposed in 2020 by a group of researchers from Facebook AI Research, University College London, and others, led by Patrick Lewis[1]. In their paper, accepted at NeurIPS 2020, they described the RAG model—a generative seq2seq model (e.g., BART) with differentiable access to an external "non-parametric" knowledge store[5]. The authors used the entire English-language Wikipedia as their external knowledge base, representing it as a vector index (~21 million text fragments) searched using the neural algorithm Dense Passage Retrieval[5]. For an incoming query, the RAG model retrieves the most suitable fragments from the index and adds them to the context for generating a response. This mechanism allowed them to achieve new state-of-the-art results on open-domain knowledge tasks, such as the Natural Questions and WebQuestions benchmarks[2]. It was noted that the RAG model's answers were more specific and factually accurate than those of previous generative approaches, thanks to the synthesis of information from multiple sources at once[2]. Facebook soon open-sourced RAG: the model was integrated into the HuggingFace Transformers library and its associated dataset, allowing developers to easily apply RAG in their projects[2]. Since 2020, the RAG method has rapidly gained popularity—according to the author, despite the unflattering acronym, the approach has become widespread, spawning hundreds of research papers and forming the basis of numerous commercial services[1].

How RAG Works

The basic architecture of Retrieval-Augmented Generation: the retrieval module (left) fetches relevant documents from a knowledge base, after which the generative model (right) forms an answer based on the user's query, taking the retrieved information into account[6]. This approach allows an LLM to rely on up-to-date external data when generating a response. The diagram shows how a user query is converted into a vector and used to find similar text fragments; these are then added to the model's context, "augmenting" its knowledge and improving the accuracy of the response.

A RAG system typically consists of two main components: a retrieval module (retriever) and a generation module (generator)[6]. During the preparation phase, a vector index of the knowledge base is built: all documents (texts) are divided into fragments and converted by an embedding model into numerical vectors, which are stored in a specialized database for subsequent retrieval[6]. When a user query is received, the same embedding model encodes the query into a vector; a nearest neighbor search is then performed in the vector space, selecting the top K most similar fragments from the knowledge index (e.g., K = 5)[6]. These fragments are considered the external context, containing probable facts related to the query's topic.

In the next stage, the assembled context is used by the generative model. The original question, along with the retrieved text fragments, is fed into an LLM (e.g., a seq2seq-type transformer or an instruction-tuned model) to generate the final answer[2]. The language model thus relies not only on its learned (parametric) knowledge but also on the external data provided to it. In the original RAG implementation, the generator was a pre-trained BART model, and the external "memory" was a collection of Wikipedia articles indexed using the DPR method[5].

Fusion Approach to Knowledge Combination

An important feature of RAG is how the model combines information from multiple retrieved documents. Instead of simply concatenating all the text, RAG uses an approach known as late fusion—the generative model processes each of the K retrieved fragments in parallel, forms a hypothetical answer for each with a confidence score, and then aggregates these options into a final output[2]. This method allows RAG to synthesize an answer even when no single source contains a direct and complete answer to the question. For example, if the necessary information is distributed across different articles, the model can combine "clues" from several documents into a unified response[2] (It has been noted that increasing the number of documents used typically improves the completeness of the answer at the cost of a slight loss in text coherence[7].)

Implementation Variants

Two modifications of the RAG architecture were proposed in the original 2020 paper[6]. In RAG-Sequence mode, the generative model receives a fixed set of retrieved documents and uses them to generate the entire response. In RAG-Token mode, by contrast, dynamic updates are allowed: at each step of generating the next token, the model can perform a new search and load an additional text fragment if needed to refine the answer. Both approaches demonstrate a similarly high level of quality; RAG-Sequence is simpler and faster, while RAG-Token theoretically allows for more diverse information to be considered in long responses[6].

Advantages of RAG

  • Timeliness and Factual Accuracy. Connecting to external data allows an LLM to provide more accurate and well-grounded answers based on real information, rather than just the model's parameters. This significantly reduces the risk of outdated or simply fabricated information in the model's response[3][1]. Unlike models with a fixed "knowledge cutoff," RAG can answer questions about events or facts that appeared after the model's training was completed—thanks to its access to fresh data sources[4].
  • Transparency and User Trust. RAG systems can provide citations to the sources of information (e.g., articles, reports, or databases) that formed the basis of the answer[1]. In essence, the model formats its responses like a research paper with footnotes, allowing the user to verify the authenticity of each fact. The presence of cited primary sources increases user trust and facilitates the verification of the information received.
  • Domain-Specific Specialization. Retrieval augmentation makes it relatively easy to adapt the model's operation to a narrow knowledge domain without changing the language model itself. To do this, one simply needs to provide the LLM with a specialized knowledge base on the desired topic—be it medical articles, legal documents, or a company's technical manuals. The model, while remaining general in its parameters, begins to act as an expert in that area because it draws facts from the curated dataset[4][8]. For example, a RAG-based legal assistant can restrict its search scope to a single jurisdictional corpus (the laws of a specific country), ensuring that its answers comply with that particular legislation[8].
  • Flexibility and Knowledge Updatability. In traditional models, adding new knowledge or correcting incorrect facts required retraining (fine-tuning) on an expanded dataset, which is costly in terms of time and resources. RAG solves this problem: to update the model's knowledge, it is sufficient to update the external database or connect additional sources, and the model will immediately begin using the new information[2]. This makes it easy to keep the system up-to-date—in fact, data can be replaced "hot" in real-time without interrupting the model's operation[1].
  • Efficiency and Resource Savings. The RAG approach is often more practical than training massive models that attempt to store all information within their parameters. By integrating retrieval, it's possible to achieve comparable results with a moderately sized model without trying to memorize every fact within the neural network itself[6]. Furthermore, implementing a RAG pipeline is relatively straightforward: ready-made tools (frameworks, libraries) are available, and developers have shown that a basic RAG prototype can be built in just a few lines of code[1]. Thus, RAG reduces the overall cost of AI implementation: instead of training a new model for each task, one only needs to configure the retrieval mechanism and provide appropriate data.

Challenges and Limitations of RAG

Despite its clear advantages, Retrieval-Augmented Generation inherits limitations from both its retrieval components and the language models themselves[9]. The following are key problems inherent in RAG systems:

  • Dependence on Retrieval Quality. The generated response will only be as correct as the retrieved data is relevant and reliable. If the retrieval module returns documents that are irrelevant to the question or contain errors, the generative model cannot "correct" these facts—it will generate an answer based on them[8]. Thus, the quality and timeliness of the external knowledge base directly determine the accuracy of RAG. The index must be regularly updated and ranking algorithms tuned to ensure the retrieved documents remain relevant.
  • High Complexity and Resource Intensity. A RAG system requires not only the LLM itself but also the infrastructure for retrieval: storing and updating a large database, indexing, and query execution time. All of this increases computational costs and can reduce response speed compared to using just a language model alone[8]. In the worst case, delays at the retrieval stage or the processing of a very large amount of data can slow the system down. In practice, a balance must be struck between response quality and performance by optimizing the pipeline (e.g., limiting the size of the knowledge base or the depth of the search to keep response times within acceptable limits).
  • Data and Maintenance Requirements. For effective operation, RAG requires high-quality, structured, and accessible external data. The retrieval model may struggle to find useful information if the external knowledge base is poorly organized or contains noise[8]. Moreover, the necessary data is not always open or inexpensive: companies often have to create and maintain their own knowledge bases. This creates additional costs and requires effort to keep the data current (e.g., adding new documents, cleaning up outdated information). A weak point of RAG is its dependence on keeping the knowledge base up-to-date.
  • Ineliminability of Some LLM Errors. Although RAG significantly reduces the number of confabulations, it is not always possible to completely eliminate incorrect answers[9]. The generative model can still make a logical error or incorrectly summarize information, especially if the provided context is incomplete or contradictory[9]. In fact, RAG shifts the focus of errors: instead of outright fabrications ("hallucinations"), knowledge integration errors become more common—for example, the model might overlook an important fragment or incorrectly link different sources together. Therefore, in critical applications (medicine, law), human oversight is still required to verify and correct the system's responses.

Applications of RAG

The Retrieval-Augmented Generation method has found application in numerous scenarios related to knowledge extraction and use. The following are the main areas where RAG demonstrates the greatest benefit:

  • Question-Answering Systems and Chatbots. RAG enables the creation of virtual assistants and chatbots that answer user questions with high accuracy and can provide links to sources. In customer support, such bots access a company's internal knowledge base (FAQ, help articles) and provide instant answers to customer queries, reducing the workload on human staff[8]. Unlike classic FAQ systems, RAG bots formulate answers in natural language while still "grounding" them with up-to-date data specific to the user's problem.
  • Medicine and Healthcare. A generative model augmented with a specialized medical database (scientific articles, clinical protocols, reference manuals) can serve as an intelligent assistant for a doctor or patient. For example, the system could answer a question about a rare diagnosis by finding recent research on the topic in medical literature[8]. A key advantage of RAG in medicine is the ability to cite primary sources (e.g., clinical trial results), which is essential for gaining the trust of medical professionals. Such systems are used for decision support, symptom checking, training medical students, and more, providing access to the latest medical knowledge.
  • Law and Finance. In legal practice and financial analysis, the accuracy and verifiability of information are especially critical. RAG systems can help professionals quickly find necessary data: for example, a lawyer can use the model to find and cite a legal precedent or a specific law relevant to a current case, while a financial analyst can quickly obtain excerpts from recent economic reports or market news[8]. Each response from the model can include links to specific documents (regulations, reports, articles), which aligns with industry standards and facilitates subsequent manual work by the specialist.
  • Scientific Research and Content Creation. Journalists, researchers, and writers can use RAG to accelerate the search for facts and sources when preparing materials. For example, the model can "gather" information from several reliable publications in response to a query, thereby significantly reducing the time spent on fact-checking and selecting quotes[8]. RAG-based research assistants automatically extract references to relevant works, data from open databases (e.g., statistics from international reports), and even draft translations, allowing authors to focus on the analytical part of their work. Such tools are used in media, academia, and for preparing literature reviews, among other things.
  • Corporate Knowledge and Document Search. In many organizations, a significant amount of valuable information is stored in text documents: regulations, manuals, reports, correspondence, and log files. RAG provides a way to perform interactive search on such unstructured data using natural language. An employee can ask a question ("What does the vacation policy say for remote employees?")—and the model will find the relevant section in an internal document, quote it, and formulate a summary response[1]. This improves efficiency: new employees find answers to their questions faster, support departments get a tool for quickly searching the incident database, and management gains a way to analyze accumulated text data. Major IT companies are already integrating the RAG approach into their corporate solutions: technologies from Microsoft, Google, IBM, AWS, and others are integrating LLMs with search over organizational data[1].

Future Prospects and Further Research

The Retrieval-Augmented Generation method is actively evolving, and its capabilities are expected to expand further in the coming years. One direction is multimodal RAG, where external information can include not only text but also images, audio/video, or even sensor data. Experiments show the promise of combining language models with search over visual databases, which would allow, for example, answering questions about the content of images or videos by relying on their descriptions and associated texts[2]. Another important direction is the simultaneous use of multiple knowledge sources: future RAG systems will be able to combine data from different databases (e.g., Wikipedia, specialized encyclopedias, a user's personal notes) and synthesize answers that take all this diverse information into account[2].

Researchers also face the task of improving the reliability and security of RAG. It is necessary to minimize the risk of spreading biases and errors that may be present in external data, as well as to ensure the consistency of responses. The team that developed the original RAG already took steps in this direction—for example, by initially limiting the knowledge base to only Wikipedia articles as a relatively verified and neutral source[2]. In the future, special filters and document selection methods are planned to ensure that the model receives high-quality context. Additionally, research is focused on improving the retrieval mechanism itself: new ranking and semantic indexing algorithms are being developed that can more accurately understand queries and find relevant information even for complex or ambiguous phrasings.

Finally, there is growing interest in a deeper integration of RAG with the training process of language models. Approaches are already emerging where retrieval mechanisms are used not only at inference time but also during the pre-training or fine-tuning of LLMs. This could further enhance the factual accuracy of models and reduce their dependence on knowledge statically stored in their weights. According to surveys published in 2024, the community sees great potential in the development of the RAG ecosystem: from infrastructure optimization (faster retrieval, reduced memory costs) to the creation of standard benchmarks for evaluating the quality of RAG systems[3]. All of this aims to make generative models more accurate, versatile, and safe when working with constantly updating external knowledge, which is a key step towards a new generation of reliable artificial intelligence.

Literature

  • Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401.
  • Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. arXiv:2004.04906.
  • Guu, K. et al. (2020). REALM: Retrieval-Augmented Language Model Pre-Training. arXiv:2002.08909.
  • Qu, Y. et al. (2020). RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. arXiv:2010.08191.
  • Izacard, G.; Grave, E. (2021). Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. arXiv:2007.01282.
  • Borgeaud, S. et al. (2022). Improving Language Models by Retrieving from Trillions of Tokens. arXiv:2112.04426.
  • Wei, J. et al. (2022). Chain of Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
  • Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171.
  • Kojima, T. et al. (2022). Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916.
  • Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
  • Mialon, G. et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997.
  • Madaan, A. et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651.
  • Yang, Z. et al. (2023). Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning. arXiv:2302.04858.
  • Barnett, S. et al. (2024). Seven Failure Points When Engineering a Retrieval Augmented Generation System. arXiv:2401.05856.
  • Wang, Y. et al. (2024). Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv:2212.10560.
  • Han, H. et al. (2025). Retrieval-Augmented Generation with Graphs (GraphRAG). arXiv:2501.00309.

Notes

  1. 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 “What Is Retrieval-Augmented Generation aka RAG”. NVIDIA Blogs. [1]
  2. 2.00 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09 2.10 «Facebook open-sources RAG, an AI model that retrieves documents to answer questions». VentureBeat. [2]
  3. 3.0 3.1 3.2 3.3 Mialon, Grégoire et al. «Retrieval-Augmented Generation for Large Language Models: A Survey». arXiv. [3]
  4. 4.0 4.1 4.2 «Applied AI Software Engineering: RAG». Pragmatic Engineer. [4]
  5. 5.0 5.1 5.2 Lewis, Patrick et al. «Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks». arXiv. [5]
  6. 6.0 6.1 6.2 6.3 6.4 6.5 6.6 «How RAG Makes LLMs Smarter». Exxact Blog. [6]
  7. «Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks». arXiv. [7]
  8. 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 «What Is RAG? Use Cases, Limitations, and Challenges». Bright Data Blog. [8]
  9. 9.0 9.1 9.2 Lewis, Patrick et al. «Seven Failure Points When Engineering a Retrieval Augmented Generation System». arXiv. [9]