Tree of Thoughts (ToT)
Tree of Thoughts (ToT) is an innovative framework for guiding the reasoning of Large Language Models (LLMs), enabling them to perform deliberate problem-solving by systematically exploring multiple reasoning paths. The concept was introduced in 2023 by researchers from Princeton University and Google DeepMind[1].
ToT is an extension and generalization of the popular "chain of thought" (CoT) prompting technique. Unlike CoT, where reasoning follows a single linear sequence of steps, ToT organizes the thought process as a tree, where each node is an intermediate state ("thought"), and the branches are possible paths for reasoning development. This allows the model to explore multiple options in parallel, evaluate their potential, return to previous steps when dead ends are encountered (backtracking), and make deliberate choices[1][2].
How It Works
The ToT framework organizes the problem-solving process as a search through a tree of states. Its operation is based on the cyclical interaction of four key components[1]:
1. Decomposing the problem into "thoughts": The initial problem is broken down into smaller sub-problems or steps, called "thoughts." Unlike CoT, where a "thought" is simply the next token, in ToT, a "thought" is a semantically meaningful unit (e.g., an equation in a math problem or a paragraph in a text outline) that brings the model closer to a solution.
2. Thought generation: At each step, for the current state (a node in the tree), the model generates several potential next "thoughts" (branches). Two strategies are used for this:
- Sampling: The model independently generates several continuation options. This is suitable for creative tasks where a wide range of ideas is beneficial.
- Proposing: The model sequentially generates options, which is more effective for tasks with a limited solution space.
3. State evaluation: The generated "thoughts" are evaluated by the LLM itself to determine their potential. The evaluation can be numerical (e.g., on a scale from 0 to 1) or categorical ("confident," "possible," "impossible"). This acts as a heuristic function that guides the search toward promising branches.
4. Search algorithm: To systematically explore the tree of thoughts, classic search algorithms are used:
- Breadth-First Search (BFS): Explores all nodes at one level before moving to the next. It guarantees finding the shortest path but requires more memory.
- Depth-First Search (DFS): Explores one branch to its end before backtracking to try another. It is more memory-efficient and suitable for tasks with a deep but not overly wide search space.
This framework mimics human problem-solving by combining intuitive idea generation (using an LLM) with deliberate, systematic planning and exploration of alternatives[2].
Comparison with Other Reasoning Methods
ToT vs. Chain of Thought (CoT)
ToT is a direct generalization of CoT. If CoT can be envisioned as a tree with a branching factor of 1, ToT allows for exploring a tree with an arbitrary branching factor. This provides key advantages[3]:
- Exploration of alternatives: ToT can consider multiple solution paths, whereas CoT is confined to a single linear path.
- Backtracking capability: ToT allows the model to "go back" if a line of reasoning reaches a dead end, which is not possible in CoT.
- Global planning: ToT enables strategic choices based on the evaluation of multiple future steps.
ToT vs. Self-Consistency
Self-Consistency generates multiple independent "chains of thought" and selects the most frequent answer through voting. This method improves the reliability of CoT, but like CoT, it does not allow for exploring a branched solution structure. ToT, in contrast, can show more significant improvements on complex planning tasks where the relationship between attempts is as important as the attempts themselves[1].
Experimental Results
The authors of ToT demonstrated its effectiveness on three tasks requiring non-trivial planning or search.
- Game of 24: A mathematical puzzle where the goal is to obtain the number 24 from four given numbers using basic arithmetic operations. Standard prompting with GPT-4 achieved a 7.3% success rate, while Chain of Thought achieved 4%. ToT with Breadth-First Search (b=5) reached a 74% success rate, an 18.5-fold improvement over CoT[1][4].
- Creative Writing: In a task involving the generation of a coherent four-paragraph text with predefined last sentences, texts created using ToT received an average coherence score of 7.56 out of 10, compared to 6.15 for CoT. In 41 out of 100 comparisons, humans preferred the text generated by ToT, versus 21 for CoT[5].
- Mini Crosswords (5x5): ToT correctly filled 60% of the words, whereas CoT only managed 1%[6].
Limitations and Future Directions
Despite its impressive results, the ToT framework has several limitations:
- Computational Complexity: ToT requires significantly more computational resources (5–100 times more tokens) than standard methods due to the need to generate and evaluate numerous "thoughts"[1].
- Implementation Complexity: Implementing ToT requires significant engineering effort to create and configure all components: the thought generator, the state evaluator, and the search algorithm.
- Dependence on Evaluation Quality: The effectiveness of the entire framework heavily depends on the LLM's ability to adequately evaluate intermediate states, which is not always guaranteed.
Future research is focused on improving efficiency, automating optimization, and integrating ToT with other methods, such as reinforcement learning, to create smarter and more autonomous agents.
Links
- Official Tree of Thoughts repository on GitHub.
- Tree of Thoughts (ToT) — a guide on Prompt Engineering Guide.
Further Reading
- Yao, S. et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601.
- Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
- Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171.
- Kojima, T. et al. (2022). Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916.
- Zhang, Z. et al. (2022). Automatic Chain of Thought Prompting in Large Language Models. arXiv:2210.03493.
- Lyu, Q. et al. (2023). Faithful Chain-of-Thought Reasoning. arXiv:2301.13379.
- Ling, Z. et al. (2023). Deductive Verification of Chain of Thought Reasoning. arXiv:2306.03872.
- Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
- Besta, M. et al. (2023). Graph of Thoughts: Solving Elaborate Problems with Large Language Models. arXiv:2308.09687.
- Lightman, H. et al. (2023). Let’s Verify Step by Step. arXiv:2305.20050.
- Lanham, T. et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning. arXiv:2307.13702.
- Yang, B. et al. (2025). Hallucination Detection in Large Language Models with Metamorphic Relations. arXiv:2502.15844.
References
- ↑ 1.0 1.1 1.2 1.3 1.4 1.5 Yao, S., Yu, D., Zhao, J., et al. (2023). "Tree of Thoughts: Deliberate Problem Solving with Large Language Models". arXiv. [1]
- ↑ 2.0 2.1 "What is Tree of Thoughts Prompting?". IBM. [2]
- ↑ "Tree of Thoughts vs Chain of Thought". Substack.
- ↑ "...18.5 times improvement...". arXiv.
- ↑ "...41 out of 100 comparisons...". OpenReview.
- ↑ "...CoT: 1% success rate...". arXiv.