Chain-of-Thought Prompting
Chain-of-Thought Prompting (CoT) is a prompt engineering technique aimed at improving the ability of large language models (LLMs) to solve complex tasks that require multi-step reasoning. Instead of generating an answer directly, a CoT prompt encourages the model to first explicitly articulate a sequence of intermediate reasoning steps that lead to the final conclusion.
This approach, which mimics the human thought process, significantly improves the accuracy of models on arithmetic, logical, and symbolic reasoning tasks.
Key Idea
The fundamental principle of CoT is to have the model "think aloud" in natural language before providing the final answer. Generating these intermediate steps allows the model to:
- Decompose complex problems: The model breaks down a complex problem into smaller, more manageable sub-problems, focusing on each one sequentially.
- Minimize errors: The step-by-step process reduces the likelihood of logical errors that often occur when attempting to provide an answer in a single step.
- Improve transparency and interpretability: Users and developers can follow the model's logic, which facilitates debugging, verification, and building trust in the results.
Historical Context
The CoT technique was first introduced on January 28, 2022, by researchers at Google Research in the paper "Chain of Thought Prompting Elicits Reasoning in Large Language Models" by Jason Wei, Denny Zhou et al.[1] They discovered that providing the model with a few examples of problems with step-by-step solutions (few-shot CoT) dramatically improves its performance on complex tasks.
This discovery revealed that the ability for multi-step reasoning is an emergent property of large models. As noted in the original paper, CoT provides a performance boost only in models that have reached a certain scale (around 100 billion parameters or more) and is virtually absent in smaller models, which may generate illogical reasoning and perform worse when using CoT.
Varieties of CoT Prompting
Few-Shot CoT: Learning from Examples
This is the original and most reliable CoT method.
- Principle: The model is provided with a few examples (typically 2 to 8), each consisting of a question — chain of reasoning — answer set.
- Advantages: High accuracy, as the model learns a specific style and format of reasoning.
- Disadvantages: Requires the manual creation of high-quality and diverse examples.
Zero-Shot CoT: "Let's think step by step"
This method was proposed later, on May 24, 2022, in the paper "Large Language Models are Zero-Shot Reasoners" by Takeshi Kojima et al.[2] and is a significantly simpler variation.
- Principle: A simple trigger phrase is added to the original prompt, such as "Let's think step by step".
- Advantages: Simplicity, flexibility, and no need for examples.
- Disadvantages: May be less accurate than Few-Shot CoT for highly specific tasks.
Automatic CoT (Auto-CoT)
This approach, proposed in the paper by Zhang et al. (2022)[3], automates the creation of demonstrations for Few-Shot CoT.
- Principle:
- Questions from a new dataset are clustered.
- A representative question is selected from each cluster.
- A chain of reasoning is generated for these questions using Zero-Shot CoT.
- The resulting demonstrations are used to construct the prompt.
- Goal: To reduce manual effort and scale the application of CoT, achieving performance comparable to manually created examples.
Multimodal CoT
The application of CoT to tasks involving data from multiple modalities (e.g., text and images).
- Principle: The model generates reasoning that connects textual and visual information.
- Application: Analyzing diagrams, solving visual puzzles.
Mechanisms and Effectiveness
- Improved Reasoning: CoT guides the model through a structured problem-solving process, which minimizes logical errors and allows it to more effectively utilize its knowledge base.
- Empirical Evidence: The effectiveness of CoT is particularly noticeable on complex benchmarks. For example, on the GSM8K arithmetic benchmark, the basic Few-Shot CoT method increased the accuracy of the PaLM-540B model from 17.9% to 58.1%. Applying more advanced techniques built upon CoT (such as Self-Consistency) can achieve accuracies of 74–78%.
- Role of Reasoning Format: Studies have shown that even examples with incorrect intermediate steps can improve the final result, as long as the overall structure of the reasoning is maintained. This suggests that CoT primarily teaches the model the format of step-by-step thinking.
Relationship with Other Techniques
CoT is a fundamental component of more advanced methods:
- Self-Consistency: Generates several different CoT chains for a single question and selects the most frequent answer through voting. This significantly improves reliability, yielding accuracy gains on benchmarks like GSM8K[4] (+17.9%)[5], SVAMP[6] (+11.0%)[1], and AQuA[7] (+12.2%)[1].
- Tree of Thoughts (ToT): Generalizes CoT by exploring not just one, but an entire tree of possible reasoning paths. Unlike the linear chain of CoT, ToT allows the model to explore multiple branches, evaluate intermediate "thoughts," and backtrack when a path is found to be unpromising. This enables solving even more complex problems where simple linear reasoning is insufficient (for example, increasing the success rate on the "Game of 24" puzzle[8] from 4% to 74%)[9].
See Also
- Large language models
- Prompt engineering
- Emergence
Further Reading
- Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
- Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171.
- Kojima, T. et al. (2022). Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916.
- Zhou, D. et al. (2022). Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv:2205.10625.
- Zhang, Z. et al. (2022). Automatic Chain of Thought Prompting in Large Language Models. arXiv:2210.03493.
- Lyu, Q. et al. (2023). Faithful Chain-of-Thought Reasoning. arXiv:2301.13379.
- Wang, X. et al. (2023). Deductive Verification of Chain-of-Thought Reasoning. arXiv:2306.03872.
- Yao, S. et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601.
- Lightman, H. et al. (2023). Let’s Verify Step by Step. arXiv:2305.20050.
- Yang, B. et al. (2025). Hallucination Detection in Large Language Models with Metamorphic Relations. arXiv:2502.15844.
Notes
- ↑ 1.0 1.1 1.2 Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv, January 10, 2023. https://doi.org/10.48550/arXiv.2201.11903.[1]
- ↑ Kojima, Takeshi, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. "Large Language Models are Zero-Shot Reasoners." arXiv, January 29, 2023. https://doi.org/10.48550/arXiv.2205.11916.[2]
- ↑ Zhang, Zhuosheng, Aston Zhang, Mu Li, and Alex Smola. "Automatic Chain of Thought Prompting in Large Language Models." arXiv, October 7, 2022. https://doi.org/10.48550/arXiv.2210.03493.[3]
- ↑ "openai/gsm8k · Datasets at Hugging Face", July 17, 2023. https://huggingface.co/datasets/openai/gsm8k.[4]
- ↑ Wang, Xuezhi, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv, March 7, 2023. https://doi.org/10.48550/arXiv.2203.11171.[5]
- ↑ Patel, Arkil. "arkilpatel/SVAMP". Python, 2021. https://github.com/arkilpatel/SVAMP.[6]
- ↑ "autonlab/aqua". Jupyter Notebook. 2022. Reprint, Auton Lab, Carnegie Mellon University, 2022. https://github.com/autonlab/aqua.[7]
- ↑ "24 (puzzle)". In Wikipedia. [8]
- ↑ Yao, Shunyu, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." arXiv, December 3, 2023. https://doi.org/10.48550/arXiv.2305.10601.[9]