Program of Thoughts Prompting

Program of Thoughts Prompting (PoT) is a prompt engineering method for large language models (LLMs) where the model generates program code as intermediate steps for solving a problem, instead of a textual explanation^[1]. This approach allows for the separation of logical reasoning from mathematical computations: the language model formulates a solution plan as a program (e.g., in Python), and the calculations are executed by an external, deterministic code interpreter.

The method was proposed in 2022 by a group of researchers led by Wenhu Chen and is primarily aimed at tasks of a numerical or logical nature (mathematical problems, financial calculations), where traditional reasoning methods like Chain-of-Thought struggled with computational accuracy^[1].

Background and Concept

Limitations of Chain-of-Thought

The PoT method is an evolution of the Chain-of-Thought (CoT) idea, which was previously the primary approach for improving the logical inference of LLMs^[2]. In the CoT method, the model generates a sequence of intermediate steps in natural language. Despite significantly improving reasoning quality, this approach has a fundamental limitation: the model performs both the logic and the calculations themselves in textual form. This often leads to inaccurate arithmetic operations, rounding errors, and other inaccuracies, as language models are not inherently precise calculators.

Core Idea of Program of Thoughts

The core idea of PoT is to delegate computations to an external system (a code interpreter), while requiring the language model only to formalize the solution plan as an executable program^[1]. The model acts as a "programmer" rather than a "calculator."

The process works as follows:

The model receives a task as input (e.g., a math word problem).
Instead of textual reasoning, it generates a script in a programming language (e.g., Python) that solves the task.
The generated code is passed to an external interpreter, which executes it.
The result of the code execution is the final answer.

Thus, complex and precise calculations (operations with large numbers, calls to specialized libraries) are performed not by the model itself but by the program, ensuring determinism and high accuracy^[3].

Implementation and Use of Libraries

In the implementation of PoT, the LLM's ability to generate correct and efficient code is key. The authors of the approach used the OpenAI Codex model, which was specifically trained on programming tasks. The PoT approach allows the model to leverage external libraries, significantly expanding the class of problems it can solve. For example, when solving symbolic mathematics problems, the model can generate code that uses the SymPy library to solve equations analytically, which is beyond the capabilities of purely language-based methods^[1].

A prompt for PoT can be provided in two modes:

Few-shot: The prompt contains several examples of "question-solution program" pairs.
Zero-shot: The prompt provides only an instruction describing the task, without examples.

Even in zero-shot mode, PoT demonstrates high effectiveness due to the explicit structure that the model is required to generate^[4].

Results and Effectiveness

The PoT method has demonstrated a significant improvement in the quality of solutions for tasks requiring multi-step numerical reasoning. In the original paper, it was tested on eight datasets of mathematical and financial problems, including GSM8K, AQUA, SVAMP, FinQA, and others.

Improved Accuracy: In all cases, PoT outperformed the baseline CoT approach. On average, a relative gain of ~12% in the proportion of correct solutions was achieved.
- On the popular GSM8K math dataset, the model's accuracy with PoT reached 71.6%, compared to 63.1% with CoT.
- In financial tasks, the improvement was even more substantial: on the FinQA dataset, accuracy increased from 40.4% (CoT) to 64.5% (PoT)^[1].

Combination with Self-Consistency: The effectiveness of PoT can be further enhanced when combined with the self-consistency method. In this case, the model generates several independent solution programs, and the final answer is chosen by a "majority vote" from their execution results. Combined with self-consistency, PoT established a new state-of-the-art at the time of publication for all tested mathematical and financial benchmarks^[1].

Advantages and Limitations

Advantages

Computational Accuracy: The main advantage. Executing arithmetic operations with an external interpreter eliminates rounding errors and inaccuracies inherent in LLMs.
Ability to Use Libraries: The model can leverage powerful external libraries (e.g., for symbolic computation, statistical analysis, or date manipulation), solving problems that were previously inaccessible.
Interpretability and Debugging: Program code provides a formal and structured representation of the solution logic, making it easier to verify and debug compared to natural language reasoning.
Versatility: The approach is effective in both few-shot and zero-shot modes and is applicable across different domains (mathematics, finance, science).

Limitations

Security: Executing generated code in an external interpreter creates security risks. The model could theoretically generate malicious code (e.g., to delete files). Therefore, practical application of PoT requires an isolated execution environment (a sandbox) and careful code filtering^[4].
Limited Scope of Applicability: The method is most effective for problems that can be clearly formalized as an algorithm. For tasks requiring an understanding of language nuances, common sense, or a creative approach, the direct application of PoT is challenging.
Dependence on Code Quality: The method's effectiveness directly depends on the LLM's ability to generate syntactically correct and logically sound code.

Related Approaches

The idea of using code to improve LLM reasoning has been developed in other similar approaches as well.

Program-Aided Language Models (PAL): A method proposed almost concurrently with PoT, which also uses Python code generation to solve problems^[5]. Conceptually, PAL and PoT are very similar and confirm the effectiveness of the "reasoning via code" strategy.
Tree of Thoughts (ToT): A more complex method that involves generating and exploring a "tree" of possible solution steps, which is an extension of the linear "chain" of thought concept. PoT can be used within the nodes of this tree to test hypotheses.

External Links

Bibliography

Chen, W. et al. (2023). Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. arXiv:2211.12588.
Wei, J. et al. (2022). Chain of Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171.
Gao, L. et al. (2022). PAL: Program-Aided Language Models. arXiv:2211.10435.
Cobbe, K. et al. (2021). Training Verifiers to Solve Math Word Problems. arXiv:2110.14168.
Chen, Z. et al. (2021). FinQA: A Dataset of Numerical Reasoning over Financial Data. arXiv:2109.00122.
Zhu, F. et al. (2021). TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance. arXiv:2105.07624.
Patel, A. et al. (2021). Are NLP Models Really Able to Solve Simple Math Word Problems? (Introducing SVAMP). arXiv:2103.07191.
Xu, F. et al. (2023). RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation. arXiv:2310.04408.
Mu, J. et al. (2023). Learning to Compress Prompts with Gist Tokens. arXiv:2304.08467.

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 Chen, W. et al. "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks". arXiv:2211.12588, 2023. [1]
↑ Wei, J. et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". arXiv:2201.11903, 2022. [2]
↑ "Program of Thoughts: Everything You Need to Know". The Ministry of AI. [3]
↑ ^4.0 ^4.1 "Program of Thoughts Prompting: Enhancing Accuracy in Reasoning and Computation". Learn Prompting. [4]
↑ "PAL (Program-Aided Language Models)". Prompt Engineering Guide. [5]

[pot_paper-1] 1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 Chen, W. et al. "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks". arXiv:2211.12588, 2023. [1]

[cot_paper-2] Wei, J. et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". arXiv:2201.11903, 2022. [2]

[pot_ministry_ai-3] "Program of Thoughts: Everything You Need to Know". The Ministry of AI. [3]

[pot_learnprompting-4] 4.0 ^4.1 "Program of Thoughts Prompting: Enhancing Accuracy in Reasoning and Computation". Learn Prompting. [4]

[pal_guide-5] "PAL (Program-Aided Language Models)". Prompt Engineering Guide. [5]

[1]

[2]

[3]

[4]

[5]

Program of Thoughts Prompting

Contents

Background and Concept

Limitations of Chain-of-Thought

Core Idea of Program of Thoughts

Implementation and Use of Libraries

Results and Effectiveness

Advantages and Limitations

Advantages

Limitations

Related Approaches

External Links

Bibliography

References

Navigation menu

Program of Thoughts Prompting

Background and Concept

Limitations of Chain-of-Thought

Core Idea of Program of Thoughts

Implementation and Use of Libraries

Results and Effectiveness

Advantages and Limitations

Advantages

Limitations

Related Approaches

External Links

Bibliography

References

Navigation menu

Search