Multimodal reasoning

Multimodal Reasoning is an artificial intelligence capability, particularly of Large Language Models (LLMs), for simultaneously processing, interpreting, and reasoning over information from multiple data types (modalities)—such as text, images, audio, and video—to solve complex tasks^[1]. This process mimics multifaceted human perception and is a key step toward creating a more versatile and adaptive artificial general intelligence (AGI)^[2].

Models with this capability are called Multimodal Large Language Models (MLLMs or LMRMs — Large Multimodal Reasoning Models). They extend the capabilities of traditional LLMs, which are trained only on text, by enabling them to understand the content of images, analyze videos, control robots, and conduct dialogue based on visual data.

Evolution of Approaches

Approaches to multimodal reasoning have undergone a rapid evolution from modular systems to unified, language-centric architectures.

Early systems: Relied on separate pipelines where distinct components processed vision and others processed text, with their representations being combined in a final stage. This approach required meticulous engineering for each specific task.
Modern systems: Have shifted to unified, language-centric models. In these systems, a large language model acts as the central hub, or reasoning "engine," processing information from all modalities in a unified format. This was made possible by methods that "taught" the language model to understand visual and other data by representing them as special tokens^[1].

A major milestone in this transition was the concept of "multimodal chain-of-thought" (Multimodal Chain-of-Thought, MCoT), where the model is given a sequence of prompts that guide it step-by-step through logical steps involving different modalities.

Architectures of Multimodal LLMs

There are two main architectural strategies for integrating different modalities with a language model^[3]:

1. Unified Token-Level Architecture

In this approach, all modalities are converted into a common representation compatible with the LLM. For example, an image is divided into patches, passed through a visual encoder (e.g., a Vision Transformer or ViT), and converted into a sequence of vector embeddings—visual tokens. These visual tokens are then concatenated (combined) with text tokens and fed into the large language model, which processes them as a single stream.

Advantages: This design requires virtually no changes to the LLM's architecture and is easily scalable.
Examples: GPT-4 by OpenAI, PaLM-E by Google.

2. Cross-Attention Architecture

Here, the language model and the visual encoder remain separate subsystems but are connected by special cross-attention layers. These layers allow the text and visual representations to influence each other during the generation process. The model effectively "glances" at the visual features at each step of generating the text response.

Advantages: It allows for the efficient use of powerful, pre-trained, and frozen models (e.g., a large LLM and a powerful ViT) by training only the connecting layers.
Example: Flamingo by DeepMind.

In modern research, unified decoder-only architectures have become dominant, as they are easier to scale and better leverage the capabilities of existing LLMs^[3].

Key Models and Research

The development of MLLMs accelerated significantly between 2022 and 2024.

Flamingo (DeepMind, 2022): One of the first large-scale visual-language models (VLMs) capable of solving diverse multimodal tasks in a few-shot learning setting without additional fine-tuning. Flamingo demonstrated that a single model could rapidly adapt to new tasks given just a few examples in the prompt^[4].

Kosmos-1 (Microsoft Research, 2023): The first MLLM trained from scratch on web-scale data. It is capable of perceiving text and images as "common modalities" and has shown strong results on image-based text tasks (OCR), multimodal dialogue, and even non-verbal reasoning tasks (Raven's Progressive Matrices)^[2].

GPT-4 (OpenAI, 2023): A flagship model positioned as a "large multimodal model" capable of accepting text and images as input. Although its architecture is not disclosed, it is known to be able to analyze the content of pictures, describe graphs, and explain visual memes. Access to its multimodal capabilities was provided on a limited basis, for example, in collaboration with the BeMyEyes app to assist blind and visually impaired individuals^[5].

PaLM-E (Google, 2023): A so-called "embodied" multimodal model designed to integrate visual perception with a robot's physical actions. PaLM-E can generate step-by-step plans for controlling robots by taking a combination of camera images and sensor readings as input. This demonstrated a "positive transfer" effect: training on general "vision+language" tasks improved the effectiveness of robotics skills^[6].

LLAMA 3.2 (Meta, 2024): An open-source series of models that also introduced multimodal versions. Their release makes MLLM technologies accessible to the broader research community for further experimentation^[3].

Challenges and Limitations

Despite impressive achievements, MLLMs face several serious challenges:

Hallucinations: Like their text-only predecessors, MLLMs can generate plausible-sounding but factually incorrect statements. Visual information does not eliminate this problem and can sometimes exacerbate it, leading to incorrect interpretations of images^[7].
Generalization and Depth of Reasoning: Models often struggle to reliably transfer conclusions to new data types (omni-modal generalization), and their reasoning can be superficial. They might be able to describe a picture but fail at tasks requiring multi-step planning that considers both text and image^[1].
Technical Hurdles: Training MLLMs requires enormous computational resources and large, carefully curated multimodal datasets. Evaluating the quality of these models is also challenging, as it requires specialized benchmarks that account for both understanding and reasoning.

Future Prospects

Trends indicate that multimodal models will become increasingly "native" multimodal (Native Large Multimodal Models), meaning they are designed from the ground up to work with all modalities. The ultimate goal is to create a universal intelligence capable of perceiving and understanding the world as richly as a human. To achieve this, researchers are working on reducing the reliance on labeled data, training models for more abstract, causal reasoning, and ensuring safe control over such powerful systems. The development of auxiliary approaches, such as HuggingGPT—where an LLM acts as a coordinator that delegates tasks to expert models—is also paving the way for more reliable multimodal AI^[8].

Links

Literature

Li, Y. et al. (2025). Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models. arXiv:2505.04921.
Lee, J. et al. (2024). Multimodal Reasoning with Multimodal Knowledge Graph. ACL 2024.
Huang, S. et al. (2023). Language Is Not All You Need: Aligning Perception with Language Models. arXiv:2302.14045.
Shen, Y. et al. (2023). HuggingGPT: Solving AI Tasks with ChatGPT and Its Friends in Hugging Face. arXiv:2303.17580.
Zhang, Z. et al. (2023). Multimodal Chain-of-Thought Reasoning in Language Models. arXiv:2302.00923.
Driess, D. et al. (2023). PaLM-E: An Embodied Multimodal Language Model. arXiv:2303.03378.
OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774.
Chen, X. et al. (2023). PaLI-X: On Scaling Up a Multilingual Vision and Language Model. arXiv:2305.18565.
Alayrac, J-B. et al. (2022). Flamingo: A Visual Language Model for Few-Shot Learning. arXiv:2204.14198.
Chen, X. et al. (2022). PaLI: A Jointly-Scaled Multilingual Language-Image Model. arXiv:2209.06794.
Huang, S. et al. (2022). Multimodal Chain-of-Thought Prompting in Large Language Models. arXiv:2302.00923.

Notes

↑ ^1.0 ^1.1 ^1.2 Yang, Z., et al. "Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models". arXiv:2505.04921 [cs.AI], May 8, 2025. [1]
↑ ^2.0 ^2.1 Huang, S., et al. "Language Is Not All You Need: Aligning Perception with Language Models". arXiv:2302.14045 [cs.CL], Feb 28, 2023. [2]
↑ ^3.0 ^3.1 ^3.2 Raschka, Sebastian. "Understanding Multimodal LLMs". Ahead of AI Magazine. [3]
↑ Alayrac, Jean-Baptiste, et al. "Tackling multiple tasks with a single visual language model". DeepMind Blog. [4]
↑ "GPT-4". OpenAI. [5]
↑ Driess, Danny, et al. "PaLM-E: An embodied multimodal language model". Google Research Blog. [6]
↑ Lee, D., et al. "Multimodal Reasoning with Multimodal Knowledge Graph". ACL Anthology, 2024. [7]
↑ Shen, Y., et al. "HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face". OpenReview. [8]

[survey_perception-1] 1.0 ^1.1 ^1.2 Yang, Z., et al. "Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models". arXiv:2505.04921 [cs.AI], May 8, 2025. [1]

[ms_kosmos1-2] 2.0 ^2.1 Huang, S., et al. "Language Is Not All You Need: Aligning Perception with Language Models". arXiv:2302.14045 [cs.CL], Feb 28, 2023. [2]

[raschka_understanding-3] 3.0 ^3.1 ^3.2 Raschka, Sebastian. "Understanding Multimodal LLMs". Ahead of AI Magazine. [3]

[deepmind_flamingo-4] Alayrac, Jean-Baptiste, et al. "Tackling multiple tasks with a single visual language model". DeepMind Blog. [4]

[openai_gpt4-5] "GPT-4". OpenAI. [5]

[google_palm-e-6] Driess, Danny, et al. "PaLM-E: An embodied multimodal language model". Google Research Blog. [6]

[acl_multimodal_kg-7] Lee, D., et al. "Multimodal Reasoning with Multimodal Knowledge Graph". ACL Anthology, 2024. [7]

[hugging_gpt-8] Shen, Y., et al. "HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face". OpenReview. [8]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Multimodal reasoning

Contents

Evolution of Approaches

Architectures of Multimodal LLMs

1. Unified Token-Level Architecture

2. Cross-Attention Architecture

Key Models and Research

Challenges and Limitations

Future Prospects

Links

Literature

Notes

Navigation menu

Multimodal reasoning

Evolution of Approaches

Architectures of Multimodal LLMs

1. Unified Token-Level Architecture

2. Cross-Attention Architecture

Key Models and Research

Challenges and Limitations

Future Prospects

Links

Literature

Notes

Navigation menu

Search