Least-to-most Prompting
Least-to-Most Prompting (LtM) is a prompting method for large language models (LLMs) that solves complex problems by decomposing them into simpler steps and then sequentially solving these subproblems[1]. This approach was proposed in 2022 by a group of Google Brain researchers led by Denny Zhou and presented at the ICLR 2023 conference[2]. The method's primary goal is to overcome the limitations of Chain-of-Thought prompting, which struggles with problems more complex than the examples shown to the model during in-context learning[2]. Least-to-Most Prompting allows the model to generalize to problems of increased complexity while remaining interpretable and not requiring additional neural network training[2]. The method's name is borrowed from educational psychology, where "least to most prompting" refers to giving a student a series of prompts with an increasing level of assistance to master a new skill[3].
Description of the Method
The Least-to-Most Prompting method is implemented in two stages[2], each presented to the language model through carefully-crafted prompts (without any additional model fine-tuning):
- Problem Decomposition. In the first stage, the model receives an instruction and examples demonstrating how to break down a complex problem into a sequence of simpler subproblems. The model is then presented with a specific complex question and must output a list of simplified intermediate questions[2]. For instance, for a complex problem, the model might independently formulate a clarifying sub-question that addresses part of the original problem.
- Sequential Solving of Subproblems. In the second stage, the model solves the resulting subproblems one by one—from the simplest to the most complex. To do this, each subproblem is preceded by context: examples of solutions to similar sub-problems, and (if available) previously solved subproblems along with their answers[4]. After solving the first subproblem, the model adds its answer to the prompt text and receives the next subproblem, using the previous solutions as context[4]. This continues until the final, most complex subproblem, which directly answers the original question, is solved.
Example: An original word problem is broken down into two stages using the Least-to-Most method. First, the model formulates and solves an intermediate question ("How long does each trip take?"), obtaining the answer “each trip takes 5 minutes." This answer is included in a new prompt along with the next subproblem—the original question (“How many times can she slide before it closes?”). Using the previous result, the model calculates the final answer (in this example: 3 times).
Fundamentally, Least-to-Most Prompting differs from the standard chain-of-thought approach by breaking the reasoning process into separate queries with accumulated knowledge, instead of generating a single continuous "chain of thought" within one response[3]. This step-by-step, recursive approach allows the model to gradually move to more complex aspects of the problem, effectively addressing the easy-to-hard generalization issue (when a model encounters a problem harder than those in its training examples)[2][3]. It is worth noting that both stages of the LtM method are implemented via few-shot prompting (demonstrating several examples) and do not require additional training or fine-tuning of the model on new data[2]. Furthermore, the method is compatible with other LLM reasoning enhancement techniques; for example, it can be combined with chain-of-thought and self-consistency (sampling multiple solutions) during response generation, although this is not necessary[1].
Experimental Results and Applications
The paper that introduced Least-to-Most Prompting showed that this method outperforms standard prompting methods (including chain-of-thought) on a range of tasks requiring complex multi-step reasoning[1]. It successfully demonstrated its advantages in three key task categories:
- Symbolic and Algorithmic Tasks. For example, in the task of concatenating the last letters of words (sequentially taking the last letter of each word in a list to form a new word), the LtM method significantly improved the model's ability to generalize to longer word sequences. Without special training, the GPT-3 model (code-davinci-002) with chain-of-thought prompts correctly solved such tasks in only about 32% of cases for a list of 12 words, whereas with Least-to-Most Prompting, the accuracy reached ~74%[1]. For short lists (of lengths seen in the examples), both strategies performed well, but as the sequence length increased, the quality of chain-of-thought dropped sharply, while Least-to-Most ensured a more gradual decline and maintained high accuracy[1]. This demonstrates the LtM method's ability to generalize the solution logic to more complex (longer) input data.
- Compositional Generalization. This category of tasks includes, for example, translating text instructions into a sequence of actions (as in the SCAN benchmark, which requires executing commands like "jump twice and run" and generalizing to longer combinations)[4]. The LtM method enabled LLMs to successfully solve even the most challenging variations of such tasks. Specifically, the GPT-3 model with LtM prompts achieved 99% accuracy on all data splits of the SCAN dataset (including the most difficult length split, where test sequences are longer than training ones), using only 14 examples in the prompt[2]. For comparison, the standard chain-of-thought approach yielded only about 16% accuracy under similar conditions[2]. Moreover, this was achieved without training the model on the training data, whereas previous top solutions for SCAN relied on special neuro-symbolic architectures or data augmentation methods that required using the entire training set of >15,000 examples[2][2]. Thus, Least-to-Most Prompting demonstrated an unprecedented ability for compositional generalization for models without fine-tuning.
- Mathematical Word Problems. The method was tested on arithmetic word problems, for instance, from the GSM8K dataset (complex word problems involving addition/subtraction and logic)[2], as well as on a series of questions from the DROP dataset (which test the ability to extract and count numerical information from text)[2]. Here too, Least-to-Most Prompting showed an improvement in accuracy compared to chain-of-thought. For GSM8K, using the code-davinci-002 model, answer accuracy increased from ~60.9% to ~62.4%[2]. On DROP subtasks, the gain was even more noticeable: for example, on a subset of "football" fact questions, accuracy rose from ~59.6% (chain-of-thought) to ~73.4% when applying LtM[2]. Although the quality improvement on mathematical tasks was less dramatic than in SCAN, the authors note an important point: almost any GSM8K problem can be solved correctly if the model is given the right problem decomposition[2]. This indicates that the key to successful solutions is well-formulated intermediate questions; the LtM approach is aimed at automatically generating such questions and solving them sequentially.
In summary, the experiments confirm that Least-to-Most Prompting significantly surpasses both naive few-shot prompting without reasoning and the chain-of-thought method on many types of tasks requiring multi-step inference[1]. The method allows LLMs to solve problems more complex than those initially encountered through examples, pushing the boundaries of in-context learning (learning on the fly via prompts).
Limitations and Future Directions
Despite its successes, the Least-to-Most Prompting method has several limitations. First and foremost, different types of problems require different decomposition approaches. A prompt template that effectively breaks down a mathematical problem may be completely unsuitable for a logical or common-sense reasoning task[2]. For example, prompts that taught the model to break down math word problems into steps were useless for a common-sense question like "Did Aristotle use a laptop?"—such a problem requires a completely different decomposition strategy[2]. Therefore, for each new domain or problem type, one must newly select examples of problem decomposition and create a corresponding prompt that illustrates the solution structure[3]. In other words, the knowledge of how to properly decompose a problem is not universally generalized by the LLM itself; it must be provided through examples for a specific class of tasks.
Moreover, the effectiveness of LtM significantly depends on how well the problem lends itself to being broken down into independent subgoals. If the model fails to correctly formulate the intermediate steps or if some necessary subproblems are missed, the final solution will also be incorrect. Nevertheless, the developers themselves note that in many cases, a failure can be turned into a success if a human manually provides the correct decomposition—the model then easily solves each part and successfully combines the answers[2]. This highlights the potential for further development of the approach: improving the quality of automatic subproblem generation and, possibly, interactive model training. In conclusion, the authors of LtM suggest that the future of prompting methods may lie in a full-fledged two-way dialogue with the model, where the model receives instant feedback and correction on its intermediate steps[2]. The Least-to-Most Prompting method can be seen as a step in this direction, showing that sequential interaction with the model through decomposition and step-by-step problem-solving can significantly expand its reasoning capabilities without training on new data[1].
Links
- Original paper “Least-to-Most Prompting Enables Complex Reasoning in Large Language Models” on arXiv
- HTML version of the original paper
- What is Least-to-Most Prompting? — article by AI Safety Info
- Overview of the method on Medium
- A comprehensive survey of prompt engineering methods on arXiv
Literature
- Zhou, D. et al. (2022). Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv:2205.10625.
- Zhou, D. et al. (2023). Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. ICLR 2023. OpenReview.
- Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
- Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171.
- Kojima, T. et al. (2022). Large Language Models Are Zero-Shot Reasoners. arXiv:2205.11916.
- Nye, M. et al. (2021). Show Your Work: Scratchpads for Intermediate Computation with Language Models. arXiv:2112.00114.
- Lake, B. M.; Baroni, M. (2018). Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks. arXiv:1711.00350.
- Cobbe, K. et al. (2021). Training Verifiers to Solve Math Word Problems. arXiv:2110.14168.
- Dua, D. et al. (2019). DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. arXiv:1903.00161.
- Zhang, Z. et al. (2022). Automatic Chain of Thought Prompting in Large Language Models. arXiv:2210.03493.
Notes
- ↑ 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Zhou, Denny et al. «Least-to-Most Prompting Enables Complex Reasoning in Large Language Models». ar5iv.org. [1]
- ↑ 2.00 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 Zhou, Denny et al. «Least-to-Most Prompting Enables Complex Reasoning in Large Language Models». arXiv. [2]
- ↑ 3.0 3.1 3.2 3.3 “What is least-to-most prompting?”. AI Safety Info. [3]
- ↑ 4.0 4.1 4.2 OXEN AI. «Arxiv Dives Toolformer: Language models can teach themselves to use tools». Medium. [4]