Self-consistency prompting

From Systems Analysis Wiki
Jump to navigation Jump to search

Self-Consistency (SC) is a decoding method or strategy in prompt engineering designed to improve the accuracy and reliability of large language models (LLM) when solving problems that require multi-step reasoning, such as arithmetic and logic puzzles[1]. The method was proposed by researchers from Google Research in 2022 as an enhancement to the Chain-of-Thought (CoT) technique.

The core idea is to avoid relying on a single, "greedy" output. Instead, it generates multiple different lines of reasoning for the same question and then selects the final answer that appears most frequently among these variants. This approach is based on an intuitive principle: if a model arrives at the same result multiple times through different reasoning paths, that result is highly likely to be correct[1].

Context and Background

The Self-Consistency method is a direct extension of the Chain-of-Thought (CoT) technique. The CoT technique, proposed by Wei et al. (2022), significantly improved the ability of LLMs to solve complex problems by prompting the model to explicitly write out the steps of its solution[2]. However, the basic implementation of CoT uses "greedy decoding," where the most probable next token is chosen at each step. This creates a limitation: if the model makes an error early on, it cannot deviate from this incorrect trajectory to correct it. Self-Consistency was proposed to address this specific problem[1].

How It Works

The Self-Consistency algorithm replaces the deterministic greedy approach with a "sample-and-aggregate" procedure, which consists of the following steps[1]:

  1. Generate multiple reasoning paths: Instead of a single answer, the model generates a solution for the same prompt multiple times (e.g., up to 40 times) using the chain-of-thought method. To obtain diverse reasoning paths, stochastic decoding methods like temperature sampling (with a temperature parameter > 0) are used.
  2. Aggregate and select the answer: From all the generated reasoning chains, only the final answers (e.g., a numerical value) are extracted. Then, the most frequently occurring answer among them is chosen. This answer is presented as the final output.

This approach mimics the principle of "self-ensembling," where multiple outputs from the same model are used to increase reliability and smooth out random errors[3].

Effectiveness and Results

In the original study, Self-Consistency demonstrated a significant increase in accuracy across several popular benchmarks, especially in tasks requiring arithmetic and logical reasoning.

  • On the GSM8K math problem benchmark, the accuracy of the PaLM-540B model increased from 56.6% (with CoT) to 74.4% (with Self-Consistency), a gain of 17.8%.
  • On other arithmetic tasks, such as SVAMP and AQuA, the gains were +11.0% and +12.2% respectively.
  • On tasks requiring logic and common sense, such as StrategyQA, the improvement was +6.4%[1].

The application of Self-Consistency set new state-of-the-art performance records on many benchmarks when using large models like GPT-3 175B and PaLM 540B[1].

Advantages and Limitations

Advantages

  • Increased Accuracy: Significantly improves results on tasks that require complex multi-step reasoning.
  • Reliability: The method is more robust against errors that might occur in a single reasoning chain.
  • Simplicity of Implementation: It does not require additional training or changes to the model's architecture. The method can be implemented as a simple "wrapper" around an existing model.

Limitations

  • High Computational Cost: The main drawback is the need to generate an answer multiple times (e.g., 10, 20, or 40 times) for a single prompt, which proportionally increases the cost and inference time.
  • Limited Applicability: The standard method is most effective for tasks with a clearly defined answer format (e.g., a number, "yes/no," a multiple-choice option), where majority voting is straightforward. It is less applicable to open-ended generation tasks (like writing essays or summarization), where answers are unique in their form.
  • Risk of Systematic Error: If the model systematically generates incorrect reasoning paths that happen to converge on the same wrong answer, Self-Consistency will not only fail to correct the error but will also reinforce confidence in it.

Further Development: Universal Self-Consistency

The limitation of the basic method on open-ended tasks was addressed in subsequent research. In late 2023, a group of researchers from Google DeepMind proposed the Universal Self-Consistency (USC) approach[4].

In USC, instead of simple majority voting on final answers, the LLM itself is used as a "judge" for aggregation. The model generates several complete solution variants and is then given a new prompt asking it to select the "most consistent" or "highest quality" option among them. This approach allows the principles of self-consistency to be applied to tasks with open-ended and creative answer formats[5].

Further Reading

  • Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171.
  • Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
  • Aggarwal, P. et al. (2023). Let’s Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs. arXiv:2305.11860.
  • Chen, X. et al. (2023). Universal Self-Consistency with Large Language Models. arXiv:2311.17311.
  • Knappe, T. et al. (2024). Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting. arXiv:2410.07839.
  • Liang, X. et al. (2024). Internal Consistency and Self-Feedback in Large Language Models: A Survey. arXiv:2407.14507.
  • Li, T. et al. (2024). Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-Consistency. arXiv:2407.21443.
  • Byerly, A.; Khashabi, D. (2024). How Effective Is Self-Consistency for Long-Context Problems?. arXiv:2411.01101.
  • Novikova, J. et al. (2025). Consistency in Language Models: Current Landscape, Challenges, and Future Directions. arXiv:2505.00268.
  • Admoni, S. et al. (2025). Towards Large Language Models with Self-Consistent Natural Language Explanations. arXiv:2506.07523.

Notes

  1. 1.0 1.1 1.2 1.3 1.4 1.5 Wang, X., Wei, J., Schuurmans, D., et al. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models". arXiv. [1]
  2. Wei, J., Wang, X., Schuurmans, D., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". NeurIPS 2022.
  3. "Self-Consistency Improves Chain of Thought Reasoning in Language Models - Summary". Portkey. [2]
  4. Chen, X., et al. (2023). "Universal Self-Consistency with Large Language Models". arXiv. [3]
  5. "Universal Self-Consistency with Large Language Models". Google DeepMind Publications. [4]