GSM8K (Grade School Math 8K)
GSM8K (Grade School Math 8K) is a benchmark dataset containing approximately 8,500 grade-school-level math word problems. It was created in 2021 by researchers at OpenAI to evaluate and advance the multi-step mathematical reasoning capabilities of large language models (LLMs)[1]. GSM8K has become one of the key benchmarks for measuring progress in the field of artificial intelligence's mathematical reasoning.
Each problem in the dataset is a short word problem that requires 2 to 8 sequential arithmetic operations (addition, subtraction, multiplication, division) to solve. Despite their apparent simplicity, the problems demand a deep understanding of the text and logical reasoning, making them challenging for many LLMs[2].
Key Characteristics
Volume and Structure
The GSM8K dataset contains approximately 8,500 problems, divided into two parts:
- Training set: ~7,500 problems intended for fine-tuning models. Each problem is accompanied by a detailed step-by-step solution.
- Test set: ~1,000 problems used for independent evaluation of model performance[1].
Complexity and Content
The problems are intentionally designed to be solvable by a capable middle school student but require multi-step reasoning. This allows for testing not just a model's mathematical knowledge, but its ability to decompose a problem and perform logical operations sequentially.
Linguistic Diversity
The problem statements in GSM8K feature a wide variety of styles and linguistic constructions. This is done to test a model's ability to understand problem conditions expressed in different ways and to avoid "memorizing" specific templates[3].
History and Evolution of Model Evaluation
Early Models and Baseline Results
In the original 2021 paper, the authors demonstrated that even large models of that time, such as GPT-3 (175 billion parameters), struggled significantly with the dataset. After fine-tuning and using a supplementary verifier model, the solution accuracy reached only about 55%[1]. This result showed that a single small error in the reasoning chain could lead to a completely incorrect answer.
Breakthrough Techniques: Chain-of-Thought
A breakthrough in solving GSM8K problems came with the "chain-of-thought" (Chain-of-Thought, CoT) prompting approach. In 2022, researchers from Google showed that prompting a model to explicitly write out the steps of its solution before giving the final answer significantly increases accuracy. The PaLM model (540 billion parameters) achieved 58% accuracy using CoT[4]. Applying the more complex technique of self-consistency (generating multiple solution paths and choosing the most frequent answer) raised the accuracy to 74%[4].
Surpassing Human-Level Performance
Starting in 2023, the latest generative models surpassed human-level performance on this benchmark.
- GPT-4 from OpenAI, using a few-shot CoT setting (where a few solved examples are provided in the prompt), achieved an accuracy of about 92%[5], and up to 97% with additional strategies[6].
- Anthropic's Claude 2 showed a result of 88%, while the newer Claude 3 achieved about 95%[3].
Such high scores indicate significant progress in the reasoning abilities of LLMs, but they also suggest that GSM8K is becoming "nearly solved" for state-of-the-art models, which is driving the development of more challenging benchmarks like MATH and MMLU.
Role in Model Training and Development
Beyond evaluation, GSM8K is actively used for training and improving models.
- Fine-tuning: The training set with its step-by-step solutions is a valuable resource for fine-tuning models on mathematical logic.
- Training verifiers: In the original OpenAI paper, a portion of the GSM8K data was used to train a separate verifier model, which evaluated the correctness of the generated solutions. This approach of separately training a generator and a critic proved to be effective[1].
- Prompt Engineering: The large number of examples has allowed researchers to develop and refine prompting techniques, such as Chain-of-Thought and Tree-of-Thought, which teach a model to reason without changing its weights.
Links
Further Reading
- Liang, P. et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
- Chang, Y. et al. (2023). A Survey on Evaluation of Large Language Models. arXiv:2307.03109.
- Ni, S. et al. (2025). A Survey on Large Language Model Benchmarks. arXiv:2508.15361.
- Biderman, S. et al. (2024). The Language Model Evaluation Harness (lm-eval): Guidance and Lessons Learned. arXiv:2405.14782.
- Kiela, D. et al. (2021). Dynabench: Rethinking Benchmarking in NLP. arXiv:2104.14337.
- Ma, Z. et al. (2021). Dynaboard: An Evaluation‑As‑A‑Service Platform for Holistic Next‑Generation Benchmarking. arXiv:2106.06052.
- Goel, K. et al. (2021). Robustness Gym: Unifying the NLP Evaluation Landscape. arXiv:2101.04840.
- Xu, C. et al. (2024). Benchmark Data Contamination of Large Language Models: A Survey. arXiv:2406.04244.
- Liu, S. et al. (2025). A Comprehensive Survey on Safety Evaluation of LLMs. arXiv:2506.11094.
- Chiang, W.-L. et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
- Boubdir, M. et al. (2023). Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. arXiv:2311.17295.
- Huang, L. et al. (2023). A Survey on Hallucination in Large Language Models. arXiv:2311.05232.
Notes
- ↑ 1.0 1.1 1.2 1.3 Cobbe, Karl et al. "Training Verifiers to Solve Math Word Problems". arXiv:2110.14168. [1]
- ↑ "GSM8K Dataset". Papers With Code. [2]
- ↑ 3.0 3.1 "GSM8K Benchmark". Klu.ai. [3]
- ↑ 4.0 4.1 Wei, Jason et al. "Language Models Perform Reasoning via Chain of Thought". Google Research Blog. [4]
- ↑ Yu, L., et al. "Solving Challenging Math Word Problems Using GPT-4". EMNLP 2023. [5]
- ↑ "Achieving >97% on GSM8K". arXiv:2404.14963. [6]