MM-RAG (Multimodal RAG) — マルチモーダルRAG

MM-RAG（英語: Multimodal Retrieval-Augmented Generation）は、古典的なRAGパラダイムの拡張であり、LLMが回答を生成するためにテキストだけでなく、視覚データ（画像、図、表、グラフ）も利用するものです。マルチモーダル検索は、様々な表現形式の証拠を発見し、関連付けることを可能にし、ページの断片や領域（bounding boxes）に正確に紐付けられた外部ソースに依拠することで、ハルシネーションのリスクを低減します^[1]^[2]。

MM-RAGは、意味の大部分が非テキスト形式（ページのレイアウト、図、表の構造など）で表現されている文書で特に有用です。このような場合、古典的なテキストベースのRAGはしばしば重要なコンテキスト要素を失います^[3]^[4]。

背景と解決されるべき問題

古典的なRAGはテキストのパッセージを操作対象とし、視覚的な構造（要素の配置、図のキャプション、グラフの軸など）を認識しません。MM-RAGはこれらのギャップを埋めるものです。構造化された要素（テキスト、表、座標付きの画像）を抽出し、それらをベクトル空間にインデックス化し、異なるモダリティからの証拠を組み合わせます^[5]^[6]。

MM-RAGのアーキテクチャ

MM-RAGのパイプラインは、古典的なRAGに視覚データの処理とモダリティのアライメントの段階を追加します：インジェスト → インデックス化 → マルチモーダル検索 → マージとリランキング → トレーサビリティ付き生成。

インジェストと前処理 (Ingestion) PDF/スキャン/画像が入力されます。OCRとページのレイアウト分析が実行され、段落、見出し、表、画像とその座標などの領域が特定されます。典型的なツールにはLayoutLMファミリーのモデルやLayoutParserライブラリがあり、検証と学習はしばしばPubLayNetやDocLayNetデータセットに依存します^[7]^[8]^[9]。
領域へのセグメンテーション 視覚的オブジェクト（図、表、イラスト、キャプション）が抽出されます。より高い堅牢性のために、OCRフリーモデル（例：Donut）や、OCR+VLMの組み合わせパイプラインが使用されます^[10]。
インデックス化 (Vector Index) テキストチャンクと視覚要素（画像またはその説明）はベクトル表現に変換され、ベクトルデータベースに格納されます。text↔imageの統合空間にはCLIPやSigLIPが使用され、本番環境ではマルチモーダル/マルチベクトルインデックス（1つのオブジェクトに複数のベクトル）が便利です^[6]^[11]^[12]。
マルチモーダル検索とリランキング テキスト検索と視覚検索が組み合わされ、候補（段落、表、画像/領域）が統合され、より「重い」モデル（クロスエンコーダ/LLMリランカー）によってリランキングされ、精度が向上します^[13]。
コンテキストのパッケージングと生成 選択されたフラグメントがLLM/VLMに供給されます。モデルがマルチモーダル（例：GPT-4V/4o）である場合、画像は直接入力できます。テキストベースのLLMの場合、画像は事前に詳細な説明に変換されます^[14]^[15]。
トレーサビリティと引用 回答には、文書/ページだけでなく、領域（座標）にも紐付けられたクリック可能な引用が付随します。これにより、groundingのレベルが向上し、ユーザーの信頼が高まります^[2]。

品質評価とメトリクス

MM-RAGの有効性は、抽出、検索、生成の各レベルで評価されます。

視覚データ抽出の品質 OCRの精度（WER/CER）、DocLayNet/PubLayNetデータセットでのレイアウト分析の品質（mAP/Precision/Recall）^[8]^[7]。
検索の品質 情報検索の標準的なメトリクス：Recall@K, Precision@K, MRR。マルチモーダリティの場合は、モダリティごとおよび統合後の評価が行われます。
回答の品質 (end‑to‑end) 自動評価メトリクス（faithfulness/groundedness）と人間による評価。実際には、RAGAS/TruLens/DeepEvalなどのフレームワークが使用されます^[16]。
ベンチマーク
- DocVQA: 文書画像に関する質問^[3]。
- TextVQA: 画像上のテキストを読む必要がある質問^[4]。
- InfographicVQA: インフォグラフィックに関する質問^[17]。
- ChartQA: 論理的推論を必要とする図に関する質問^[18]。
- MMDocRAG: DocQAのためのマルチモーダルRAGベンチマーク（複数ページの文書、クロスモーダルな証拠連鎖）^[19]。

コンポーネントの比較表

MM-RAGにおける主要コンポーネントとアプローチの比較
コンポーネント	実装の選択肢	長所	短所 / リスク	選択基準
OCR	Tesseract / PaddleOCR / クラウドAPI	ローカルはプライバシーと制御を確保。クラウドは「すぐに使える」高い精度を提供。	複雑なレイアウトでのエラー。APIはコストとコンプライアンス要件あり。	プライベートデータならローカルOCR。最高の精度が必要で許容されるならクラウド。
レイアウト分析	ルールベース / MLモデル (LayoutLM, LayoutParser)	ルールは同じタイプのテンプレートに単純。MLは多様性に強い。	ルールは新しいレイアウトで破綻。MLはリソース/データが必要。	同じタイプのフォームならルール。多様なコーパスならML。
ベクトル化 (画像)	CLIP / SigLIP / OCRフリー記述 (Donut/Pix2Struct)	text↔imageの共通潜在空間 (CLIP/SigLIP)。OCRフリーはOCRへの依存を解消。	CLIPは画像内のテキストを読み取らない。記述が意味を歪める可能性あり。	CLIP/SigLIPは基本的なマルチモーダル検索に。OCRフリーは品質の低いスキャンに。
結果のマージ	スコアによるソート / モダリティごとのクォータ / LLMリランカー	リランカーはコンテキスト選択の精度を大幅に向上させる。	遅延とコストの増加。	高精度が求められるシナリオ。PoCにはシンプルな手法で十分。
ストレージ/インデックス	単一ベクトル / マルチベクトル (text+image) / ハイブリッド (BM25+vector)	マルチベクトルは1つのオブジェクトの異なる表現をカバー。ハイブリッドはキーワード/コードを救う。	スキーマと更新の複雑化。	混合データと厳しいSLAを持つ本番システム。

実践的な注意点

ハイブリッド検索（BM25 + ベクトル）は、特定の専門用語やコードに対する再現率と精度を向上させるための事実上の標準です^[20]。
クロスエンコーダ/LLMによるリランキングは、生成前に「ノイズ」となる候補を除外することで、トークンを節約します^[13]。
現代のVLMリトリーバー（例：ColPali）は、ページ画像を直接インデックス化することにより、視覚的にリッチな文書で優位性を示します^[21]。

参考文献

Lewis, P. et al. (2020). Retrieval‑Augmented Generation for Knowledge‑Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.
Gao, L. et al. (2023). Precise Zero‑Shot Dense Retrieval without Relevance Labels (HyDE). ACL 2023. arXiv:2212.10496.
Mei, L., Mo, S., Yang, Z., Chen, C. (2025). A Survey of Multimodal Retrieval‑Augmented Generation. arXiv:2504.08748.
Abootorabi, M.M. et al. (2025). Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval‑Augmented Generation. Findings of ACL 2025. ACL Anthology.
Yu, S. et al. (2024). VisRAG: Vision‑based Retrieval‑augmented Generation on Multi‑modality Documents. arXiv:2410.10594.
Cho, J. et al. (2024). M3DocRAG: Multi‑modal Retrieval is What You Need for Multi‑document QA. arXiv:2411.04952.
Tanaka, R. et al. (2025). VDocRAG: Retrieval‑Augmented Generation over Visually‑Rich Documents. CVPR 2025. arXiv:2504.09795 • CVF Open Access.
Dong, K. et al. (2025). MMDocRAG: Benchmarking Retrieval‑Augmented Multimodal Generation for Document Question Answering. arXiv:2505.16470.
Wasserman, N. et al. (2025). REAL‑MM‑RAG: A Real‑World Multi‑Modal Retrieval Benchmark. arXiv:2502.12342.
Faysse, M. et al. (2024). ColPali: Efficient Document Retrieval with Vision Language Models. arXiv:2407.01449.
Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML 2021. arXiv:2103.00020.
Tschannen, M. et al. (2025). SigLIP 2: Multilingual Vision‑Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. arXiv:2502.14786.
Xu, Y. et al. (2020). LayoutLM: Pre‑training of Text and Layout for Document Image Understanding. KDD 2020. DOI • arXiv:1912.13318.
Huang, Y. et al. (2022). LayoutLMv3: Pre‑training for Document AI with Unified Text and Image Masking. arXiv:2204.08387.
Zhong, X., Tang, J., Jimeno‑Yepes, A.J. (2019). PubLayNet: Largest Dataset Ever for Document Layout Analysis. arXiv:1908.07836.
Pfitzmann, B. et al. (2022). DocLayNet: A Large Human‑Annotated Dataset for Document‑Layout Analysis. arXiv:2206.01062.
Kim, G. et al. (2021). OCR‑free Document Understanding Transformer (Donut). arXiv:2111.15664.
Shen, Z. et al. (2021). LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. arXiv:2103.15348.
Singh, A. et al. (2019). Towards VQA Models That Can Read (TextVQA). CVPR 2019. arXiv:1904.08920.
Mathew, M. et al. (2021/2022). DocVQA / InfographicVQA: Datasets for VQA on Document Images and Infographics. WACV 2021 / WACV 2022. CVF • arXiv:2104.12756.
Masry, A. et al. (2022). ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. Findings of ACL 2022. ACL • arXiv:2203.10244.
Liu, F. et al. (2022). DePlot: One‑shot Visual Language Reasoning by Plot‑to‑Table Translation. arXiv:2212.10505.
Wang, P. et al. (2024). Qwen2‑VL: Enhancing Vision‑Language Model’s Capabilities in OCR and Chart QA. arXiv:2409.12191.
Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval‑Augmented Generation. arXiv:2309.15217.

注釈

↑ Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.
↑ ^2.0 ^2.1 Yu, S. et al. (2024). VisRAG: Vision-based Retrieval-Augmented Generation on Multi-modality Documents. arXiv:2407.06437.
↑ ^3.0 ^3.1 Mathew, M. et al. (2021). DocVQA: A Dataset for VQA on Document Images. WACV. arXiv:2007.00398.
↑ ^4.0 ^4.1 Singh, A. et al. (2019). TextVQA: Towards VQA Models That Can Read. CVPR. arXiv:1904.08920.
↑ Xu, Y. et al. (2020). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. KDD. DOI.
↑ ^6.0 ^6.1 Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML. arXiv:2103.00020.
↑ ^7.0 ^7.1 Zhong, X., Tang, J., Yepes, A. J. (2019). PubLayNet: Largest Dataset for Document Layout Analysis. ICDAR. arXiv:1908.07836.
↑ ^8.0 ^8.1 Pfitzmann, B. et al. (2022). DocLayNet: A Large Human‑Annotated Dataset for Document‑Layout Analysis. KDD. DOI / arXiv:2206.01062.
↑ Shen, Z. et al. (2021). LayoutParser: A Unified Toolkit for DL‑based Document Image Analysis. arXiv:2103.15348.
↑ Kim, G. et al. (2021). Donut: OCR‑free Document Understanding Transformer. arXiv:2111.15664.
↑ Zhai, X. et al. (2023). Sigmoid Loss for Language‑Image Pre‑Training (SigLIP). ICCV. arXiv:2303.15343.
↑ Milvus Docs. Multi‑Vector Hybrid Search. milvus.io/docs/multi-vector-search.md.
↑ ^13.0 ^13.1 Cohere Docs. Rerank API. docs.cohere.com/reference/rerank.
↑ OpenAI. GPT‑4V(ision) System Card. (2023). PDF.
↑ OpenAI. Hello GPT‑4o. (2024). openai.com/index/hello-gpt-4o/.
↑ Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217.
↑ Mathew, M. et al. (2021). InfographicVQA: Understanding Infographics via Question Answering. ICDAR. arXiv:2104.12756.
↑ Masry, A. et al. (2022). ChartQA: A Benchmark for Question Answering about Charts. ACL (Findings). arXiv:2103.16435.
↑ Dong, K. et al. (2025). Benchmarking Retrieval‑Augmented Multimodal Generation for Document QA (MMDocRAG). arXiv:2505.16470.
↑ Weaviate Docs. Hybrid search. docs.weaviate.io/.../hybrid-search.
↑ Faysse, M. et al. (2024). ColPali: Efficient Document Retrieval with Vision‑Language Models. arXiv:2407.01449.

[lewis2020-1] Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.

[visrag2024-2] 2.0 ^2.1 Yu, S. et al. (2024). VisRAG: Vision-based Retrieval-Augmented Generation on Multi-modality Documents. arXiv:2407.06437.

[docvqa2021-3] 3.0 ^3.1 Mathew, M. et al. (2021). DocVQA: A Dataset for VQA on Document Images. WACV. arXiv:2007.00398.

[textvqa2019-4] 4.0 ^4.1 Singh, A. et al. (2019). TextVQA: Towards VQA Models That Can Read. CVPR. arXiv:1904.08920.

[layoutlm2020-5] Xu, Y. et al. (2020). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. KDD. DOI.

[clip2021-6] 6.0 ^6.1 Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML. arXiv:2103.00020.

[publaynet2019-7] 7.0 ^7.1 Zhong, X., Tang, J., Yepes, A. J. (2019). PubLayNet: Largest Dataset for Document Layout Analysis. ICDAR. arXiv:1908.07836.

[doclaynet2022-8] 8.0 ^8.1 Pfitzmann, B. et al. (2022). DocLayNet: A Large Human‑Annotated Dataset for Document‑Layout Analysis. KDD. DOI / arXiv:2206.01062.

[layoutparser2021-9] Shen, Z. et al. (2021). LayoutParser: A Unified Toolkit for DL‑based Document Image Analysis. arXiv:2103.15348.

[donut2021-10] Kim, G. et al. (2021). Donut: OCR‑free Document Understanding Transformer. arXiv:2111.15664.

[siglip2023-11] Zhai, X. et al. (2023). Sigmoid Loss for Language‑Image Pre‑Training (SigLIP). ICCV. arXiv:2303.15343.

[milvus-mv-12] Milvus Docs. Multi‑Vector Hybrid Search. milvus.io/docs/multi-vector-search.md.

[cohere-rerank-13] 13.0 ^13.1 Cohere Docs. Rerank API. docs.cohere.com/reference/rerank.

[gpt4v-14] OpenAI. GPT‑4V(ision) System Card. (2023). PDF.

[gpt4o-15] OpenAI. Hello GPT‑4o. (2024). openai.com/index/hello-gpt-4o/.

[ragas-16] Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217.

[infographicvqa2021-17] Mathew, M. et al. (2021). InfographicVQA: Understanding Infographics via Question Answering. ICDAR. arXiv:2104.12756.

[chartqa2022-18] Masry, A. et al. (2022). ChartQA: A Benchmark for Question Answering about Charts. ACL (Findings). arXiv:2103.16435.

[mmdocrag2025-19] Dong, K. et al. (2025). Benchmarking Retrieval‑Augmented Multimodal Generation for Document QA (MMDocRAG). arXiv:2505.16470.

[weaviate-hybrid-20] Weaviate Docs. Hybrid search. docs.weaviate.io/.../hybrid-search.

[colpali2024-21] Faysse, M. et al. (2024). ColPali: Efficient Document Retrieval with Vision‑Language Models. arXiv:2407.01449.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

MM-RAG (Multimodal RAG) — マルチモーダルRAG

Contents

背景と解決されるべき問題

MM-RAGのアーキテクチャ

品質評価とメトリクス

コンポーネントの比較表

実践的な注意点

参考文献

関連項目

注釈

Navigation menu

MM-RAG (Multimodal RAG) — マルチモーダルRAG

背景と解決されるべき問題

MM-RAGのアーキテクチャ

品質評価とメトリクス

コンポーネントの比較表

実践的な注意点

参考文献

関連項目

注釈

Navigation menu

Search