MM-RAG (Multimodal RAG)

From Systems Analysis Wiki
Jump to navigation Jump to search

MM-RAG (Multimodal Retrieval-Augmented Generation) is an extension of the classic RAG paradigm in which LLMs use not only text but also visual data (images, diagrams, tables, charts) to generate answers. Multimodal retrieval allows for finding and linking evidence across different representations, reducing the risk of hallucinations by grounding the generation in external sources with precise references to page fragments and regions (bounding boxes)[1][2].

MM-RAG is particularly useful for documents where a significant portion of the meaning is conveyed in a non-textual form (page layout, diagrams, table structures). In such cases, classic text-based RAG often loses important contextual elements[3][4].

Context and the Problem Being Solved

Classic RAG operates on text passages and is unaware of visual structures (element layout, figure captions, chart axes). MM-RAG addresses these gaps by extracting structured elements (text, tables, images with coordinates), indexing them into a vector space, and combining evidence from different modalities[5][6].

MM-RAG Architecture

The MM-RAG pipeline extends the classic RAG pipeline with stages for visual data processing and modality alignment: ingestion → indexing → multimodal retrieval → fusion and reranking → generation with traceability.

  1. Ingestion and Preprocessing. The pipeline takes PDFs, scans, or images as input. It performs OCR and layout analysis to identify regions such as paragraphs, headings, tables, images, and their coordinates. Common tools include models from the LayoutLM family and libraries like LayoutParser; validation and training often rely on datasets like PubLayNet and DocLayNet[7][8][9].
  2. Region Segmentation. Visual objects (diagrams, tables, illustrations, captions) are extracted. For increased robustness, OCR-free models (e.g., Donut) or combined OCR+VLM pipelines are used[10].
  3. Indexing (Vector Index). Text chunks and visual elements (images or their descriptions) are converted into vector representations and stored in a vector database. For the shared text↔image space, models like CLIP or SigLIP are used; for production environments, multimodal/multi-vector indexes (one object, multiple vectors) are convenient[6][11][12].
  4. Multimodal Retrieval and Reranking. A combination of text and visual search is performed. The candidates (paragraphs, tables, images/regions) are combined and reranked by a more heavyweight model (a cross-encoder or LLM reranker) to improve precision[13].
  5. Context Assembly and Generation. The selected fragments are fed into an LLM/VLM. If the model is multimodal (e.g., GPT-4V/4o), images can be passed directly; if it's a text-only LLM, images are converted into detailed descriptions beforehand[14][15].
  6. Traceability and Citation. The generated answer is accompanied by clickable citations linked not only to the document/page but also to the specific region (coordinates). This enhances grounding and user trust[2].

Quality Evaluation and Metrics

The effectiveness of MM‑RAG is evaluated at the extraction, retrieval, and generation levels.

  • Visual Data Extraction Quality. OCR accuracy (WER/CER), layout analysis quality (mAP/Precision/Recall) on datasets like DocLayNet/PubLayNet[8][7].
  • Retrieval Quality. Standard IR metrics: Recall@K, Precision@K, MRR; for multimodality, these are measured separately for each modality and for the combined results.
  • Answer Quality (End‑to‑End). Automated metrics for faithfulness/groundedness and human evaluation. In practice, frameworks like RAGAS, TruLens, and DeepEval are used[16].
  • Benchmarks.
    • DocVQA: questions about document images[3].
    • TextVQA: questions that require reading text within images[4].
    • InfographicVQA: questions about infographics[17].
    • ChartQA: questions about charts that require logical reasoning[18].
    • MMDocRAG: a benchmark for multimodal RAG for DocQA (multi-page documents, cross-modal evidence chains)[19].

Comparative Table of Components

Comparison of key components and approaches in MM‑RAG
Component Implementation Options Pros Cons / Risks When to Choose
OCR Tesseract / PaddleOCR / Cloud APIs Local options offer privacy and control; cloud APIs provide high out-of-the-box accuracy. Errors on complex layouts; APIs involve costs and compliance requirements. For private data, use local OCR; for maximum accuracy, use a cloud service (if permissible).
Layout Analysis Rule-based / ML model (LayoutLM, LayoutParser) Rules are simple for uniform templates; ML is robust to variety. Rules break on new layouts; ML requires resources/data. For uniform forms, use rules; for a diverse corpus, use ML.
Vectorization (Images) CLIP / SigLIP / OCR-free descriptions (Donut/Pix2Struct) Shared latent space for text↔image (CLIP/SigLIP); OCR-free removes dependency on OCR. CLIP does not read text within images; descriptions can distort meaning. CLIP/SigLIP for basic multimodal search; OCR-free for low-quality scans.
Result Fusion Sort by score / Modality-based quotas / LLM-reranker A reranker significantly improves context selection accuracy. Increased latency and cost. For high-precision scenarios; simpler methods for PoCs.
Storage/Index Single vector / Multi-vector (text+image) / Hybrid (BM25+vector) Multi-vector covers different representations of one object; hybrid search helps with keywords/codes. Increased schema and update complexity. Production systems with mixed data and strict SLAs.

Practical Notes

  • Hybrid search (BM25 + vector) is the de facto standard for improving recall and precision on specific terms/codes[20].
  • Reranking with a cross-encoder or LLM saves tokens by discarding irrelevant candidates before generation[13].
  • Modern VLM retrievers (e.g., ColPali) show advantages on visually rich documents by directly indexing page images[21].

Literature

  • Lewis, P. et al. (2020). Retrieval‑Augmented Generation for Knowledge‑Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.
  • Gao, L. et al. (2023). Precise Zero‑Shot Dense Retrieval without Relevance Labels (HyDE). ACL 2023. arXiv:2212.10496.
  • Mei, L., Mo, S., Yang, Z., Chen, C. (2025). A Survey of Multimodal Retrieval‑Augmented Generation. arXiv:2504.08748.
  • Abootorabi, M.M. et al. (2025). Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval‑Augmented Generation. Findings of ACL 2025. ACL Anthology.
  • Yu, S. et al. (2024). VisRAG: Vision‑based Retrieval‑augmented Generation on Multi‑modality Documents. arXiv:2410.10594.
  • Cho, J. et al. (2024). M3DocRAG: Multi‑modal Retrieval is What You Need for Multi‑document QA. arXiv:2411.04952.
  • Tanaka, R. et al. (2025). VDocRAG: Retrieval‑Augmented Generation over Visually‑Rich Documents. CVPR 2025. arXiv:2504.09795CVF Open Access.
  • Dong, K. et al. (2025). MMDocRAG: Benchmarking Retrieval‑Augmented Multimodal Generation for Document Question Answering. arXiv:2505.16470.
  • Wasserman, N. et al. (2025). REAL‑MM‑RAG: A Real‑World Multi‑Modal Retrieval Benchmark. arXiv:2502.12342.
  • Faysse, M. et al. (2024). ColPali: Efficient Document Retrieval with Vision Language Models. arXiv:2407.01449.
  • Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML 2021. arXiv:2103.00020.
  • Tschannen, M. et al. (2025). SigLIP 2: Multilingual Vision‑Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. arXiv:2502.14786.
  • Xu, Y. et al. (2020). LayoutLM: Pre‑training of Text and Layout for Document Image Understanding. KDD 2020. DOIarXiv:1912.13318.
  • Huang, Y. et al. (2022). LayoutLMv3: Pre‑training for Document AI with Unified Text and Image Masking. arXiv:2204.08387.
  • Zhong, X., Tang, J., Jimeno‑Yepes, A.J. (2019). PubLayNet: Largest Dataset Ever for Document Layout Analysis. arXiv:1908.07836.
  • Pfitzmann, B. et al. (2022). DocLayNet: A Large Human‑Annotated Dataset for Document‑Layout Analysis. arXiv:2206.01062.
  • Kim, G. et al. (2021). OCR‑free Document Understanding Transformer (Donut). arXiv:2111.15664.
  • Shen, Z. et al. (2021). LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. arXiv:2103.15348.
  • Singh, A. et al. (2019). Towards VQA Models That Can Read (TextVQA). CVPR 2019. arXiv:1904.08920.
  • Mathew, M. et al. (2021/2022). DocVQA / InfographicVQA: Datasets for VQA on Document Images and Infographics. WACV 2021 / WACV 2022. CVFarXiv:2104.12756.
  • Masry, A. et al. (2022). ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. Findings of ACL 2022. ACLarXiv:2203.10244.
  • Liu, F. et al. (2022). DePlot: One‑shot Visual Language Reasoning by Plot‑to‑Table Translation. arXiv:2212.10505.
  • Wang, P. et al. (2024). Qwen2‑VL: Enhancing Vision‑Language Model’s Capabilities in OCR and Chart QA. arXiv:2409.12191.
  • Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval‑Augmented Generation. arXiv:2309.15217.

See Also

Notes

  1. Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.
  2. 2.0 2.1 Yu, S. et al. (2024). VisRAG: Vision-based Retrieval-Augmented Generation on Multi-modality Documents. arXiv:2407.06437.
  3. 3.0 3.1 Mathew, M. et al. (2021). DocVQA: A Dataset for VQA on Document Images. WACV. arXiv:2007.00398.
  4. 4.0 4.1 Singh, A. et al. (2019). TextVQA: Towards VQA Models That Can Read. CVPR. arXiv:1904.08920.
  5. Xu, Y. et al. (2020). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. KDD. DOI.
  6. 6.0 6.1 Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML. arXiv:2103.00020.
  7. 7.0 7.1 Zhong, X., Tang, J., Yepes, A. J. (2019). PubLayNet: Largest Dataset for Document Layout Analysis. ICDAR. arXiv:1908.07836.
  8. 8.0 8.1 Pfitzmann, B. et al. (2022). DocLayNet: A Large Human‑Annotated Dataset for Document‑Layout Analysis. KDD. DOI / arXiv:2206.01062.
  9. Shen, Z. et al. (2021). LayoutParser: A Unified Toolkit for DL‑based Document Image Analysis. arXiv:2103.15348.
  10. Kim, G. et al. (2021). Donut: OCR‑free Document Understanding Transformer. arXiv:2111.15664.
  11. Zhai, X. et al. (2023). Sigmoid Loss for Language‑Image Pre‑Training (SigLIP). ICCV. arXiv:2303.15343.
  12. Milvus Docs. Multi‑Vector Hybrid Search. milvus.io/docs/multi-vector-search.md.
  13. 13.0 13.1 Cohere Docs. Rerank API. docs.cohere.com/reference/rerank.
  14. OpenAI. GPT‑4V(ision) System Card. (2023). PDF.
  15. OpenAI. Hello GPT‑4o. (2024). openai.com/index/hello-gpt-4o/.
  16. Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217.
  17. Mathew, M. et al. (2021). InfographicVQA: Understanding Infographics via Question Answering. ICDAR. arXiv:2104.12756.
  18. Masry, A. et al. (2022). ChartQA: A Benchmark for Question Answering about Charts. ACL (Findings). arXiv:2103.16435.
  19. Dong, K. et al. (2025). Benchmarking Retrieval‑Augmented Multimodal Generation for Document QA (MMDocRAG). arXiv:2505.16470.
  20. Weaviate Docs. Hybrid search. docs.weaviate.io/.../hybrid-search.
  21. Faysse, M. et al. (2024). ColPali: Efficient Document Retrieval with Vision‑Language Models. arXiv:2407.01449.