MMLU Benchmark
MMLU (an acronym for Measuring Massive Multitask Language Understanding) is a benchmark test designed to evaluate the capabilities of large language models (LLMs) across a wide range of subject areas. The benchmark was developed in 2020 by a team of researchers led by Dan Hendrycks of UC Berkeley and was published at the ICLR conference in 2021[1].
The goal of MMLU is to test how well a model has absorbed the diverse knowledge and skills acquired during its pre-training phase by evaluating it in a zero-shot or few-shot setting without additional fine-tuning. MMLU was created as a more challenging alternative to existing tests (such as GLUE and SuperGLUE), on which many models had already reached human-level performance by 2020[2].
Description and Content
MMLU consists of 15,908 multiple-choice questions covering 57 different disciplines. The subjects include:
- STEM subjects (mathematics, physics, biology, computer science).
- Humanities and social sciences (history, literature, law, management).
- Applied and professional fields (medicine, jurisprudence, business)[1].
The difficulty ranges from elementary school level to advanced professional level. The questions are based on real exam materials from schools, universities, and professional tests, such as the GRE and USMLE[1]. The format consists of four possible answers for each question, meaning that random guessing yields an accuracy of 25%. To achieve a high score, a model must possess extensive encyclopedic knowledge and reasoning abilities.
Results and Development
When MMLU was released in 2020, most LLMs performed only slightly better than random guessing. The best result was achieved by the GPT-3 model (175B parameters), which scored ~43.9% correct answers. For comparison, a human expert averaged ~90%[1]. This gap confirmed the difficulty and high standard of the new benchmark.
Over time, MMLU became one of the most popular tests for LLMs, achieving "gold standard" status in the reports of leading AI companies[3]. By 2023-2024, the latest models, such as GPT-4, Google's Gemini Ultra, and Anthropic's Claude 3.5, had approached human-level performance, achieving ~85-90% accuracy[2][3].
This rapid progress led to a gradual "saturation" of the benchmark: leading models began to achieve near-maximum scores, which diminished MMLU's ability to differentiate between their intellectual capabilities. This has spurred the community to develop new, more difficult tests[3].
Limitations and Criticism
Despite its widespread use, MMLU has several significant limitations.
Data Quality and Correctness
In June 2024, researchers conducted a manual analysis of a sample of 5,700 MMLU questions and found a significant number of errors[4].
- Approximately 6.5% of all MMLU questions contain errors in their labeling or wording.
- In certain categories, the proportion of incorrect questions is very high. For example, in the "Virology" section, 57% of the questions contained errors (multiple correct answers, incorrect wording, or the wrong reference answer).
This means that even a perfect model cannot score 100% on the original dataset, and some of the improvements in metrics may be due to the model memorizing systematic errors in the set[4].
Evaluation Methodology and Data Leakage
- Lack of a testing standard. Different developers may use different prompts and few-shot settings, making direct comparison of model results difficult.
- Data contamination. There is a risk that questions and answers from public benchmarks might be included in the training datasets of LLMs. In such cases, the model effectively "knows" the correct answers, making the evaluation unfair[3].
Derivative Versions and Extensions
To address the problems of the original MMLU, several variants have been created.
- MMLU-Redux. A corrected and refined version of the dataset, introduced in June 2024. It includes 3,000 relabeled questions from 30 categories and is designed for more reliable model evaluation without distortions caused by data errors[4].
- MMLU-Pro. An expanded and more difficult version of the test, introduced in late 2024. It contains over 12,000 questions, each with 10 answer choices instead of four. This reduces the probability of random guessing to 10%. The questions have been expert-verified and include new tasks from more challenging sources[5].
- MMMLU (Multilingual MMLU). A multilingual version released by OpenAI in 2023. The entire MMLU dataset was professionally translated into 14 languages, including both widely spoken ones (Spanish, Chinese, Russian) and low-resource ones (e.g., Yoruba). This allows for the evaluation and comparison of model capabilities across different languages[6].
Links
Further Reading
- Liang, P. et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
- Chang, Y. et al. (2023). A Survey on Evaluation of Large Language Models. arXiv:2307.03109.
- Ni, S. et al. (2025). A Survey on Large Language Model Benchmarks. arXiv:2508.15361.
- Biderman, S. et al. (2024). The Language Model Evaluation Harness (lm-eval): Guidance and Lessons Learned. arXiv:2405.14782.
- Kiela, D. et al. (2021). Dynabench: Rethinking Benchmarking in NLP. arXiv:2104.14337.
- Ma, Z. et al. (2021). Dynaboard: An Evaluation‑As‑A‑Service Platform for Holistic Next‑Generation Benchmarking. arXiv:2106.06052.
- Goel, K. et al. (2021). Robustness Gym: Unifying the NLP Evaluation Landscape. arXiv:2101.04840.
- Xu, C. et al. (2024). Benchmark Data Contamination of Large Language Models: A Survey. arXiv:2406.04244.
- Liu, S. et al. (2025). A Comprehensive Survey on Safety Evaluation of LLMs. arXiv:2506.11094.
- Chiang, W.-L. et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
- Boubdir, M. et al. (2023). Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. arXiv:2311.17295.
- Huang, L. et al. (2023). A Survey on Hallucination in Large Language Models. arXiv:2311.05232.
References
- ↑ 1.0 1.1 1.2 1.3 Hendrycks, D. et al. "Measuring Massive Multitask Language Understanding". arXiv:2009.03300, 2021. [1]
- ↑ 2.0 2.1 "MMLU". In Wikipedia. [2]
- ↑ 3.0 3.1 3.2 3.3 "NEW SAVANNA: The AI industry lacks useful ways of measuring performance". New Savanna Blog, 2024. [3]
- ↑ 4.0 4.1 4.2 Gema, A. P. et al. "Are We Done with MMLU?". arXiv:2406.04127, 2024. [4]
- ↑ "MMLU Pro". Vals.ai, 2025. [5]
- ↑ "openai/MMMLU". Hugging Face Datasets. [6]