Humanity's Last Exam (benchmark)

From Systems Analysis Wiki
Jump to navigation Jump to search

Humanity's Last Exam (HLE) is a comprehensive test benchmark designed to evaluate the capabilities of advanced artificial intelligence (AI) systems on tasks requiring a level of knowledge and reasoning skills comparable to top human experts. The benchmark was developed in 2024–2025 by the non-profit organization Center for AI Safety (CAIS) in collaboration with Scale AI[1].

The HLE project is conceived as a "final academic exam" for AI models—an exceptionally difficult test to determine how close modern models are to an expert level and where gaps in their abilities remain[1]. The benchmark includes 2,500 extremely challenging questions covering over one hundred different disciplines[2].

Background

By the mid-2020s, major language models such as GPT-4 and Claude had demonstrated such high performance on popular test suites (e.g., MMLU) that many benchmarks ceased to be a reliable measure of progress. Standard undergraduate-level exams were effectively 'solved' by the models, making it impossible to objectively assess further improvements[3].

In this context, Dan Hendrycks, director of CAIS and a renowned AI researcher, proposed the concept of "Humanity's Last Exam"—a set of questions of maximum difficulty that could distinguish AI capabilities from the level of a true expert. The impetus came from a conversation with entrepreneur Elon Musk, who expressed the opinion that existing tests had become too easy[2].

To implement the idea, CAIS joined forces with Scale AI. On September 15, 2024, a global call for the most difficult questions for the future exam was officially announced. The organizers invited scientists and specialists from around the world to submit problems capable of stumping even the most advanced AI models. A prize fund of $500,000 was established to motivate participants[3].

The selection of problems occurred in several stages. First, the submitted questions were filtered using advanced AI models: if the algorithms confidently solved a problem, it was discarded as not being difficult enough. Tasks that the AI failed to solve then underwent expert review to assess their correctness and ensure they had a single, verifiable correct answer. Ultimately, nearly 1,000 experts from over 500 academic and educational institutions contributed to the creation of the dataset[4].

The final version of the benchmark, comprising 2,500 questions, was presented in early 2025. A portion of the questions is kept in a private holdout set for control testing and to prevent models from overfitting to a fixed set[2].

Structure and Content of the Benchmark

The HLE question set covers a vast range of academic disciplines. The questions are distributed by subject area as follows:

  • Mathematics: ~41%
  • Biology and Medicine: ~11%
  • Computer Science and AI: ~10%
  • Physics: ~9%
  • Humanities and Social Sciences: ~9%
  • Chemistry: ~7%
  • Engineering: ~4%
  • Other fields: ~9%

Approximately 14% of all tasks are multimodal, meaning they require the analysis of images (drawings, diagrams, inscriptions) for their solution[2]. The majority (about 3/4) of the tasks are open-ended short-answer questions, where the model must independently generate a precise answer (a number, term, or name). The rest are multiple-choice questions.

All tasks in HLE share common properties:

  • Extremely high difficulty: Each problem requires a level of knowledge and skill comparable to that of a qualified specialist in the field[5].
  • Verifiable answer: Each question has a specific and provably correct answer.
  • Resistance to search: The tasks are designed so that the answer cannot be found with a simple search query; success requires a deep understanding of the subject and reasoning[1].

Model Performance Results

Humanity's Last Exam immediately confirmed its reputation as an extremely challenging test: no contemporary AI model was able to achieve a score close to human performance. The best language models of 2025 demonstrated very low accuracy.

  • Various versions of GPT-4 from OpenAI and Claude from Anthropic scored less than 10%[4].
  • The highest score among standard LLMs was achieved by the Gemini 2.5 Pro model (Google DeepMind) with an accuracy of about 21.6%[4].
  • Even the best models failed about 4/5 of the HLE questions, highlighting the scale of the gap between current AI capabilities and the level of a human expert[1].

Of particular interest is the result of the experimental agent ChatGPT Deep Research from OpenAI, which was allowed to automatically perform search queries. By simulating the work of a researcher, this agent managed to correctly solve 26.6% of the tasks—a score more than twice as high as any model without such tools, but still very far from a passing grade[6].

Significance and Outlook

The emergence of HLE was a significant event in the AI community, as the benchmark filled a pressing need for a new, more challenging measure of progress.

  • A common baseline. HLE offers researchers and policymakers an objective tool for assessing AI capabilities, allowing them to track the dynamics of improvement and understand how close machines are to the human level.
  • A tool to inform policy. The existence of such a reference test facilitates more substantive discussions about the directions of AI development, potential risks, and necessary regulatory measures.
  • The final frontier of academic testing. The very name "Last Exam" reflects the idea that this set of problems could become the final closed-book exam for evaluating AI. Successfully passing HLE would mean that, in terms of formal knowledge and rigorously verifiable reasoning skills, a machine has reached the level of the best human experts[4].

It is important to note that even a perfect score on HLE would not signify the achievement of artificial general intelligence (AGI), as the test does not evaluate creative abilities, initiative, or the skill of posing new scientific questions[4].

Given the rapid pace of progress, researchers anticipate that models could exceed 50% accuracy on HLE by the end of 2025. This would mean that machines have come very close to the human level on a narrow but important metric of academic knowledge[4].

Further Reading

  • Liang, P. et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
  • Chang, Y. et al. (2023). A Survey on Evaluation of Large Language Models. arXiv:2307.03109.
  • Ni, S. et al. (2025). A Survey on Large Language Model Benchmarks. arXiv:2508.15361.
  • Biderman, S. et al. (2024). The Language Model Evaluation Harness (lm-eval): Guidance and Lessons Learned. arXiv:2405.14782.
  • Kiela, D. et al. (2021). Dynabench: Rethinking Benchmarking in NLP. arXiv:2104.14337.
  • Ma, Z. et al. (2021). Dynaboard: An Evaluation‑As‑A‑Service Platform for Holistic Next‑Generation Benchmarking. arXiv:2106.06052.
  • Goel, K. et al. (2021). Robustness Gym: Unifying the NLP Evaluation Landscape. arXiv:2101.04840.
  • Xu, C. et al. (2024). Benchmark Data Contamination of Large Language Models: A Survey. arXiv:2406.04244.
  • Liu, S. et al. (2025). A Comprehensive Survey on Safety Evaluation of LLMs. arXiv:2506.11094.
  • Chiang, W.-L. et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
  • Boubdir, M. et al. (2023). Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. arXiv:2311.17295.
  • Huang, L. et al. (2023). A Survey on Hallucination in Large Language Models. arXiv:2311.05232.

References

  1. 1.0 1.1 1.2 1.3 Fan, L. et al. "Humanity's Last Exam: A New Benchmark for AI Alignment". arXiv:2501.14249, 2025. [1]
  2. 2.0 2.1 2.2 2.3 "Humanity's Last Exam". In Wikipedia. [2]
  3. 3.0 3.1 Dastin, J. & Paul, K. "AI experts ready 'Humanity's Last Exam' to stump powerful tech". Reuters, 2024. [3]
  4. 4.0 4.1 4.2 4.3 4.4 4.5 "Humanity's Last Exam". Center for AI Safety. [4]
  5. "Could you pass 'Humanity's Last Exam'? Probably not, but neither can AI". TechRadar. [5]
  6. "OpenAI's deep research can complete 26% of 'Humanity's Last Exam': What is it and what does it mean?". Hindustan Times. [6]