Self-consistency prompting

Self-Consistency (SC) is a decoding method or strategy in prompt engineering designed to improve the accuracy and reliability of large language models (LLM) when solving problems that require multi-step reasoning, such as arithmetic and logic puzzles^[1]. The method was proposed by researchers from Google Research in 2022 as an enhancement to the Chain-of-Thought (CoT) technique.

The core idea is to avoid relying on a single, "greedy" output. Instead, it generates multiple different lines of reasoning for the same question and then selects the final answer that appears most frequently among these variants. This approach is based on an intuitive principle: if a model arrives at the same result multiple times through different reasoning paths, that result is highly likely to be correct^[1].

Context and Background

The Self-Consistency method is a direct extension of the Chain-of-Thought (CoT) technique. The CoT technique, proposed by Wei et al. (2022), significantly improved the ability of LLMs to solve complex problems by prompting the model to explicitly write out the steps of its solution^[2]. However, the basic implementation of CoT uses "greedy decoding," where the most probable next token is chosen at each step. This creates a limitation: if the model makes an error early on, it cannot deviate from this incorrect trajectory to correct it. Self-Consistency was proposed to address this specific problem^[1].

How It Works

The Self-Consistency algorithm replaces the deterministic greedy approach with a "sample-and-aggregate" procedure, which consists of the following steps^[1]:

Generate multiple reasoning paths: Instead of a single answer, the model generates a solution for the same prompt multiple times (e.g., up to 40 times) using the chain-of-thought method. To obtain diverse reasoning paths, stochastic decoding methods like temperature sampling (with a temperature parameter > 0) are used.
Aggregate and select the answer: From all the generated reasoning chains, only the final answers (e.g., a numerical value) are extracted. Then, the most frequently occurring answer among them is chosen. This answer is presented as the final output.

This approach mimics the principle of "self-ensembling," where multiple outputs from the same model are used to increase reliability and smooth out random errors^[3].

Effectiveness and Results

In the original study, Self-Consistency demonstrated a significant increase in accuracy across several popular benchmarks, especially in tasks requiring arithmetic and logical reasoning.

On the GSM8K math problem benchmark, the accuracy of the PaLM-540B model increased from 56.6% (with CoT) to 74.4% (with Self-Consistency), a gain of 17.8%.
On other arithmetic tasks, such as SVAMP and AQuA, the gains were +11.0% and +12.2% respectively.
On tasks requiring logic and common sense, such as StrategyQA, the improvement was +6.4%^[1].

The application of Self-Consistency set new state-of-the-art performance records on many benchmarks when using large models like GPT-3 175B and PaLM 540B^[1].

Advantages and Limitations

Advantages

Increased Accuracy: Significantly improves results on tasks that require complex multi-step reasoning.
Reliability: The method is more robust against errors that might occur in a single reasoning chain.
Simplicity of Implementation: It does not require additional training or changes to the model's architecture. The method can be implemented as a simple "wrapper" around an existing model.

Limitations

High Computational Cost: The main drawback is the need to generate an answer multiple times (e.g., 10, 20, or 40 times) for a single prompt, which proportionally increases the cost and inference time.
Limited Applicability: The standard method is most effective for tasks with a clearly defined answer format (e.g., a number, "yes/no," a multiple-choice option), where majority voting is straightforward. It is less applicable to open-ended generation tasks (like writing essays or summarization), where answers are unique in their form.
Risk of Systematic Error: If the model systematically generates incorrect reasoning paths that happen to converge on the same wrong answer, Self-Consistency will not only fail to correct the error but will also reinforce confidence in it.

Further Development: Universal Self-Consistency

The limitation of the basic method on open-ended tasks was addressed in subsequent research. In late 2023, a group of researchers from Google DeepMind proposed the Universal Self-Consistency (USC) approach^[4].

In USC, instead of simple majority voting on final answers, the LLM itself is used as a "judge" for aggregation. The model generates several complete solution variants and is then given a new prompt asking it to select the "most consistent" or "highest quality" option among them. This approach allows the principles of self-consistency to be applied to tasks with open-ended and creative answer formats^[5].

Links

Self-Consistency Improves Chain of Thought Reasoning in Language Models — the original research paper from Google Research.
Self-Consistency — a guide on Prompt Engineering Guide.

Notes

↑ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 Wang, X., Wei, J., Schuurmans, D., et al. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models". arXiv. [1]
↑ Wei, J., Wang, X., Schuurmans, D., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". NeurIPS 2022.
↑ "Self-Consistency Improves Chain of Thought Reasoning in Language Models - Summary". Portkey. [2]
↑ Chen, X., et al. (2023). "Universal Self-Consistency with Large Language Models". arXiv. [3]
↑ "Universal Self-Consistency with Large Language Models". Google DeepMind Publications. [4]

[wang2022-1] 1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 Wang, X., Wei, J., Schuurmans, D., et al. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models". arXiv. [1]

[wei2022-2] Wei, J., Wang, X., Schuurmans, D., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". NeurIPS 2022.

[portkey_summary-3] "Self-Consistency Improves Chain of Thought Reasoning in Language Models - Summary". Portkey. [2]

[chen2023_usc-4] Chen, X., et al. (2023). "Universal Self-Consistency with Large Language Models". arXiv. [3]

[deepmind_usc-5] "Universal Self-Consistency with Large Language Models". Google DeepMind Publications. [4]

[1]

[2]

[3]

[4]

[5]

Self-consistency prompting

Contents

Context and Background

How It Works

Effectiveness and Results

Advantages and Limitations

Advantages

Limitations

Further Development: Universal Self-Consistency

Links

Further Reading

Notes

Navigation menu

Self-consistency prompting

Context and Background

How It Works

Effectiveness and Results

Advantages and Limitations

Advantages

Limitations

Further Development: Universal Self-Consistency

Links

Further Reading

Notes

Navigation menu

Search