MM-RAG (Multimodal RAG) — 多模态RAG

MM-RAG (英文：Multimodal Retrieval-Augmented Generation，多模态检索增强生成) 是经典RAG范式的扩展。在这种范式中，LLM不仅使用文本，还利用视觉数据（图像、图表、表格、图形）来生成回答。多模态检索能够查找并关联不同形式的证据，通过依赖外部来源，并将信息精确定位到页面片段和区域（bounding boxes，边界框），从而降低了产生幻觉的风险^[1]^[2]。

MM-RAG 对于大部分信息以非文本形式（如页面布局、图表、表格结构）呈现的文档尤其有用。在这类情况下，经典的纯文本 RAG 常常会丢失重要的上下文元素^[3]^[4]。

背景与待解决的问题

经典 RAG 处理的是文本段落，无法识别视觉结构（如元素布局、图片标题、图表坐标轴）。MM-RAG 填补了这些空白：它提取结构化元素（文本、表格、带坐标的图像），将其索引到向量空间中，并结合来自不同模态的证据^[5]^[6]。

MM-RAG 架构

MM-RAG 的工作流在经典 RAG 的基础上增加了视觉数据处理和模态对齐的步骤：数据注入 (Ingestion) → 索引 (Indexing) → 多模态检索 (Multimodal Retrieval) → 融合与重排 (Fusion & Reranking) → 可追溯生成 (Generation with Tracing)。

数据采集与预处理 (Ingestion)：输入 PDF、扫描件或图像。系统执行 OCR 和页面布局分析，以识别并划分区域：段落、标题、表格、图像及其坐标。常用工具包括 LayoutLM 系列模型和 LayoutParser 工具库；验证和训练通常依赖于 PubLayNet 和 DocLayNet 数据集^[7]^[8]^[9]。
区域分割：提取视觉对象（图表、表格、插图、标题）。为提高鲁棒性，可采用无需 OCR 的模型（如 Donut）或 OCR+VLM 的组合流水线^[10]。
索引 (Vector Index)：将文本块和视觉元素（图像或其描述）转换为向量表示，并存入向量数据库。为了建立统一的文本↔图像空间，通常使用 CLIP 或 SigLIP；在生产环境中，多模态/多向量索引（一个对象对应多个向量）更为便捷^[6]^[11]^[12]。
多模态检索与重排：结合文本和视觉搜索；将候选结果（段落、表格、图像/区域）合并，并使用更“重”的模型（如交叉编码器/LLM 重排器）进行重排，以提高准确性^[13]。
上下文打包与生成：将筛选出的片段输入 LLM/VLM。如果模型是多模态的（如 GPT-4V/4o），图像可以直接输入；如果使用纯文本 LLM，图像则需要预先转换为详细描述^[14]^[15]。
追溯与引用：生成的回答附带可点击的引用，不仅链接到文档/页面，还能精确定位到具体区域（坐标）。这提升了grounding（信息溯源）水平和用户信任度^[2]。

质量评估与指标

MM-RAG 的有效性从提取、检索和生成三个层面进行评估。

视觉数据提取质量：在 DocLayNet/PubLayNet 等数据集上评估 OCR 准确率（WER/CER）和布局分析质量（mAP/Precision/Recall）^[8]^[7]。
检索质量：采用标准信息检索指标：Recall@K、Precision@K、MRR；对于多模态，需分别评估各模态及组合后的效果。
端到端回答质量：使用自动化指标 faithfulness/groundedness 以及人工评估。实践中常使用 RAGAS/TruLens/DeepEval 等框架^[16]。
基准测试 (Benchmarks)：
- DocVQA：针对文档图像的问答^[3]。
- TextVQA：需要读取图像中文本才能回答的问题^[4]。
- InfographicVQA：针对信息图表的问答^[17]。
- ChartQA：针对图表的问答，需要逻辑推理^[18]。
- MMDocRAG：用于文档问答（DocQA）的多模态 RAG 基准测试，包含多页文档和跨模态证据链^[19]。

组件对比表

MM-RAG 关键组件与方法对比
组件	实现方案	优点	缺点 / 风险	选择建议
OCR	Tesseract / PaddleOCR / 云服务 API	本地部署方案保护隐私且可控；云服务则提供开箱即用的高精度。	复杂布局可能导致识别错误；API 方案有成本和合规性要求。	涉密数据采用本地 OCR；追求最高精度则使用云服务（若合规）。
布局分析	规则 / 机器学习模型 (LayoutLM, LayoutParser)	规则对标准化模板简单有效；机器学习对多样化布局鲁棒性强。	规则在处理新布局时容易失效；机器学习则需要计算资源和数据。	标准化表单使用规则；多样化文档库则采用机器学习。
图像向量化	CLIP / SigLIP / 无需 OCR 的描述生成 (Donut/Pix2Struct)	CLIP/SigLIP 提供统一的文本↔图像潜在空间；无需 OCR 的方法摆脱了对 OCR 的依赖。	CLIP 无法读取图像内的文本；而文本描述可能会歪曲原意。	CLIP/SigLIP 用于基础多模态搜索；对于质量较差的扫描件，可采用无需 OCR 的方法。
结果融合	按分数排序 / 按模态分配配额 / LLM 重排器	重排器能显著提高上下文选择的准确性。	增加延迟和成本。	高精度场景；简单方法适用于概念验证 (PoC)。
存储/索引	单向量 / 多向量 (文本+图像) / 混合索引 (BM25+向量)	多向量索引能覆盖同一对象的不同表示；混合搜索则能有效处理关键词和代码。	方案和更新流程更复杂。	适用于混合数据和有严格服务等级协议 (SLA) 的生产系统。

实践说明

混合搜索 (BM25 + 向量)：事实上已成为提高特定术语/代码检索召回率和准确率的标准方法^[20]。
重排：使用交叉编码器/LLM进行重排，可以在生成前过滤掉“垃圾”候选，从而节省 token^[13]。
现代 VLM 检索器（例如 ColPali）通过直接索引页面图像，在视觉内容丰富的文档上展现出优势^[21]。

参见

Retrieval-Augmented Generation
向量数据库
Embedding
GraphRAG

参考文献

Lewis, P. et al. (2020). Retrieval‑Augmented Generation for Knowledge‑Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.
Gao, L. et al. (2023). Precise Zero‑Shot Dense Retrieval without Relevance Labels (HyDE). ACL 2023. arXiv:2212.10496.
Mei, L., Mo, S., Yang, Z., Chen, C. (2025). A Survey of Multimodal Retrieval‑Augmented Generation. arXiv:2504.08748.
Abootorabi, M.M. et al. (2025). Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval‑Augmented Generation. Findings of ACL 2025. ACL Anthology.
Yu, S. et al. (2024). VisRAG: Vision‑based Retrieval‑augmented Generation on Multi‑modality Documents. arXiv:2410.10594.
Cho, J. et al. (2024). M3DocRAG: Multi‑modal Retrieval is What You Need for Multi‑document QA. arXiv:2411.04952.
Tanaka, R. et al. (2025). VDocRAG: Retrieval‑Augmented Generation over Visually‑Rich Documents. CVPR 2025. arXiv:2504.09795 • CVF Open Access.
Dong, K. et al. (2025). MMDocRAG: Benchmarking Retrieval‑Augmented Multimodal Generation for Document Question Answering. arXiv:2505.16470.
Wasserman, N. et al. (2025). REAL‑MM‑RAG: A Real‑World Multi‑Modal Retrieval Benchmark. arXiv:2502.12342.
Faysse, M. et al. (2024). ColPali: Efficient Document Retrieval with Vision Language Models. arXiv:2407.01449.
Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML 2021. arXiv:2103.00020.
Tschannen, M. et al. (2025). SigLIP 2: Multilingual Vision‑Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. arXiv:2502.14786.
Xu, Y. et al. (2020). LayoutLM: Pre‑training of Text and Layout for Document Image Understanding. KDD 2020. DOI • arXiv:1912.13318.
Huang, Y. et al. (2022). LayoutLMv3: Pre‑training for Document AI with Unified Text and Image Masking. arXiv:2204.08387.
Zhong, X., Tang, J., Jimeno‑Yepes, A.J. (2019). PubLayNet: Largest Dataset Ever for Document Layout Analysis. arXiv:1908.07836.
Pfitzmann, B. et al. (2022). DocLayNet: A Large Human‑Annotated Dataset for Document‑Layout Analysis. arXiv:2206.01062.
Kim, G. et al. (2021). OCR‑free Document Understanding Transformer (Donut). arXiv:2111.15664.
Shen, Z. et al. (2021). LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. arXiv:2103.15348.
Singh, A. et al. (2019). Towards VQA Models That Can Read (TextVQA). CVPR 2019. arXiv:1904.08920.
Mathew, M. et al. (2021/2022). DocVQA / InfographicVQA: Datasets for VQA on Document Images and Infographics. WACV 2021 / WACV 2022. CVF • arXiv:2104.12756.
Masry, A. et al. (2022). ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. Findings of ACL 2022. ACL • arXiv:2203.10244.
Liu, F. et al. (2022). DePlot: One‑shot Visual Language Reasoning by Plot‑to‑Table Translation. arXiv:2212.10505.
Wang, P. et al. (2024). Qwen2‑VL: Enhancing Vision‑Language Model’s Capabilities in OCR and Chart QA. arXiv:2409.12191.
Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval‑Augmented Generation. arXiv:2309.15217.

注释

↑ Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.
↑ ^2.0 ^2.1 Yu, S. et al. (2024). VisRAG: Vision-based Retrieval-Augmented Generation on Multi-modality Documents. arXiv:2407.06437.
↑ ^3.0 ^3.1 Mathew, M. et al. (2021). DocVQA: A Dataset for VQA on Document Images. WACV. arXiv:2007.00398.
↑ ^4.0 ^4.1 Singh, A. et al. (2019). TextVQA: Towards VQA Models That Can Read. CVPR. arXiv:1904.08920.
↑ Xu, Y. et al. (2020). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. KDD. DOI.
↑ ^6.0 ^6.1 Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML. arXiv:2103.00020.
↑ ^7.0 ^7.1 Zhong, X., Tang, J., Yepes, A. J. (2019). PubLayNet: Largest Dataset for Document Layout Analysis. ICDAR. arXiv:1908.07836.
↑ ^8.0 ^8.1 Pfitzmann, B. et al. (2022). DocLayNet: A Large Human‑Annotated Dataset for Document‑Layout Analysis. KDD. DOI / arXiv:2206.01062.
↑ Shen, Z. et al. (2021). LayoutParser: A Unified Toolkit for DL‑based Document Image Analysis. arXiv:2103.15348.
↑ Kim, G. et al. (2021). Donut: OCR‑free Document Understanding Transformer. arXiv:2111.15664.
↑ Zhai, X. et al. (2023). Sigmoid Loss for Language‑Image Pre‑Training (SigLIP). ICCV. arXiv:2303.15343.
↑ Milvus Docs. Multi‑Vector Hybrid Search. milvus.io/docs/multi-vector-search.md.
↑ ^13.0 ^13.1 Cohere Docs. Rerank API. docs.cohere.com/reference/rerank.
↑ OpenAI. GPT‑4V(ision) System Card. (2023). PDF.
↑ OpenAI. Hello GPT‑4o. (2024). openai.com/index/hello-gpt-4o/.
↑ Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217.
↑ Mathew, M. et al. (2021). InfographicVQA: Understanding Infographics via Question Answering. ICDAR. arXiv:2104.12756.
↑ Masry, A. et al. (2022). ChartQA: A Benchmark for Question Answering about Charts. ACL (Findings). arXiv:2103.16435.
↑ Dong, K. et al. (2025). Benchmarking Retrieval‑Augmented Multimodal Generation for Document QA (MMDocRAG). arXiv:2505.16470.
↑ Weaviate Docs. Hybrid search. docs.weaviate.io/.../hybrid-search.
↑ Faysse, M. et al. (2024). ColPali: Efficient Document Retrieval with Vision‑Language Models. arXiv:2407.01449.

[lewis2020-1] Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.

[visrag2024-2] 2.0 ^2.1 Yu, S. et al. (2024). VisRAG: Vision-based Retrieval-Augmented Generation on Multi-modality Documents. arXiv:2407.06437.

[docvqa2021-3] 3.0 ^3.1 Mathew, M. et al. (2021). DocVQA: A Dataset for VQA on Document Images. WACV. arXiv:2007.00398.

[textvqa2019-4] 4.0 ^4.1 Singh, A. et al. (2019). TextVQA: Towards VQA Models That Can Read. CVPR. arXiv:1904.08920.

[layoutlm2020-5] Xu, Y. et al. (2020). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. KDD. DOI.

[clip2021-6] 6.0 ^6.1 Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML. arXiv:2103.00020.

[publaynet2019-7] 7.0 ^7.1 Zhong, X., Tang, J., Yepes, A. J. (2019). PubLayNet: Largest Dataset for Document Layout Analysis. ICDAR. arXiv:1908.07836.

[doclaynet2022-8] 8.0 ^8.1 Pfitzmann, B. et al. (2022). DocLayNet: A Large Human‑Annotated Dataset for Document‑Layout Analysis. KDD. DOI / arXiv:2206.01062.

[layoutparser2021-9] Shen, Z. et al. (2021). LayoutParser: A Unified Toolkit for DL‑based Document Image Analysis. arXiv:2103.15348.

[donut2021-10] Kim, G. et al. (2021). Donut: OCR‑free Document Understanding Transformer. arXiv:2111.15664.

[siglip2023-11] Zhai, X. et al. (2023). Sigmoid Loss for Language‑Image Pre‑Training (SigLIP). ICCV. arXiv:2303.15343.

[milvus-mv-12] Milvus Docs. Multi‑Vector Hybrid Search. milvus.io/docs/multi-vector-search.md.

[cohere-rerank-13] 13.0 ^13.1 Cohere Docs. Rerank API. docs.cohere.com/reference/rerank.

[gpt4v-14] OpenAI. GPT‑4V(ision) System Card. (2023). PDF.

[gpt4o-15] OpenAI. Hello GPT‑4o. (2024). openai.com/index/hello-gpt-4o/.

[ragas-16] Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217.

[infographicvqa2021-17] Mathew, M. et al. (2021). InfographicVQA: Understanding Infographics via Question Answering. ICDAR. arXiv:2104.12756.

[chartqa2022-18] Masry, A. et al. (2022). ChartQA: A Benchmark for Question Answering about Charts. ACL (Findings). arXiv:2103.16435.

[mmdocrag2025-19] Dong, K. et al. (2025). Benchmarking Retrieval‑Augmented Multimodal Generation for Document QA (MMDocRAG). arXiv:2505.16470.

[weaviate-hybrid-20] Weaviate Docs. Hybrid search. docs.weaviate.io/.../hybrid-search.

[colpali2024-21] Faysse, M. et al. (2024). ColPali: Efficient Document Retrieval with Vision‑Language Models. arXiv:2407.01449.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

MM-RAG (Multimodal RAG) — 多模态RAG

Contents

背景与待解决的问题

MM-RAG 架构

质量评估与指标

组件对比表

实践说明

参见

参考文献

注释

Navigation menu

MM-RAG (Multimodal RAG) — 多模态RAG

背景与待解决的问题

MM-RAG 架构

质量评估与指标

组件对比表

实践说明

参见

参考文献

注释

Navigation menu

Search