HumanEval Benchmark
HumanEval is a benchmark dataset (benchmark) designed to objectively evaluate the quality of code generated by artificial intelligence models based on a natural language problem description[1]. It was introduced in July 2021 by OpenAI researchers led by Mark Chen and has become one of the key standards for measuring the functional correctness of generated programs.
The development of HumanEval was driven by the need for a reliable method to evaluate code generation. Previously, the quality of code generated by language models was often assessed using indirect metrics (such as BLEU) or manual inspection, which did not guarantee the actual functional correctness of the programs. HumanEval addresses this problem by focusing on functional correctness: generated code is evaluated not on its syntactic similarity to a reference solution, but on its ability to successfully pass a set of automated unit tests[1].
Structure of the Task Set
The benchmark consists of 164 programming problems, handwritten specifically for this dataset to ensure they were not present in the models' training sets. All problems are formulated in Python and presented as code snippets with descriptions[2].
Each problem includes:
- Function signature: The function's signature, including its name and parameters.
- Text description: An English-language docstring describing the required functionality.
- Function body: An empty space that the model must fill with generated code.
For each problem, there are also elements hidden from the model:
- Canonical solution: A correct (reference) implementation of the function.
- A set of unit tests: Used to automatically verify the correctness of the generated code. The tests cover both standard and edge cases.
The problems cover a wide range of topics, from basic language constructs and algorithms to simple mathematics, making the dataset diverse and challenging for models.
Model Evaluation Methodology
The primary success metric on HumanEval is pass@k, which measures the fraction of problems a model solves functionally correctly[1]. A solution is considered successful if the generated code passes all automated tests for that problem.
- pass@1: The main and most commonly used metric. It represents the percentage of problems solved by the model on the first attempt (i.e., when generating a single solution candidate per problem).
- pass@k: In the general case, this metric indicates the fraction of problems for which at least one of k generated solution candidates passes all tests. For example, pass@10 shows the proportion of problems the model can solve if given up to 10 attempts for each.
Since directly computing pass@k requires a large number of samples, the authors proposed a statistically sound method to calculate this metric, which provides an unbiased estimate[1].
Practical testing is conducted in a special "sandbox" environment: the generated code is compiled and executed against a set of verification tests. This approach measures the model's ability to generate executable and correct code, rather than code that is merely syntactically similar to a reference solution.
Results and Impact on the Industry
Initial experiments on HumanEval revealed a significant gap between general-purpose models and models specifically trained on code.
- In 2021, the OpenAI Codex model (12 billion parameters, trained on code from GitHub) managed to solve ~28.8% of the problems on its first attempt (pass@1).
- At the same time, the larger language model GPT-3 (175 billion), which was not trained on code, failed to correctly solve a single problem[1].
These results highlighted the necessity of specialized training on programming data for successful code generation. After its introduction, HumanEval quickly became the standard test for comparing the progress of new models.
- Models in the GPT-3.5 series (early 2023) achieved a pass@1 score of ~72%.
- The GPT-4 model (2023) demonstrated even higher results, reaching ~67% in its base version and exceeding 85% after additional fine-tuning[3].
- Open-source models, such as Code Llama (Meta, 2023) and WizardCoder (2023), also showed strong performance (~53% and ~57% pass@1 respectively), surpassing early proprietary models[4].
Extensions and Variants of the Benchmark
The success of HumanEval inspired the creation of several derivative versions designed to evaluate models under a broader range of conditions.
- CL-HumanEval (Cross-Lingual HumanEval): A cross-lingual variant (2024) where the original HumanEval problems are adapted to test a model's ability to understand descriptions in different languages (besides English) while generating code in Python[5].
- Multilingual HumanEval: An extension (2023) aimed at evaluating code generation in 12 different programming languages (including Java, C#, JavaScript, etc.) based on English descriptions[6].
- HumanEval-XL: A large-scale benchmark (2024) that combines both approaches. It contains problems in 12 programming languages and text descriptions translated into 23 world languages, including Russian, Chinese, Arabic, and others. In total, HumanEval-XL comprises over 22,000 "description-code" pairs[7].
Links
Literature
- Liang, P. et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
- Chang, Y. et al. (2023). A Survey on Evaluation of Large Language Models. arXiv:2307.03109.
- Ni, S. et al. (2025). A Survey on Large Language Model Benchmarks. arXiv:2508.15361.
- Biderman, S. et al. (2024). The Language Model Evaluation Harness (lm-eval): Guidance and Lessons Learned. arXiv:2405.14782.
- Kiela, D. et al. (2021). Dynabench: Rethinking Benchmarking in NLP. arXiv:2104.14337.
- Ma, Z. et al. (2021). Dynaboard: An Evaluation‑As‑A‑Service Platform for Holistic Next‑Generation Benchmarking. arXiv:2106.06052.
- Goel, K. et al. (2021). Robustness Gym: Unifying the NLP Evaluation Landscape. arXiv:2101.04840.
- Xu, C. et al. (2024). Benchmark Data Contamination of Large Language Models: A Survey. arXiv:2406.04244.
- Liu, S. et al. (2025). A Comprehensive Survey on Safety Evaluation of LLMs. arXiv:2506.11094.
- Chiang, W.-L. et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
- Boubdir, M. et al. (2023). Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. arXiv:2311.17295.
- Huang, L. et al. (2023). A Survey on Hallucination in Large Language Models. arXiv:2311.05232.
Notes
- ↑ 1.0 1.1 1.2 1.3 1.4 Chen, M. et al. "Evaluating Large Language Models Trained on Code". arXiv:2107.03374, 2021. [1]
- ↑ "openai/openai_humaneval". Hugging Face Datasets. [2]
- ↑ Niu, C. et al. "On Evaluating the Efficiency of Source Code Generated by LLMs". Proceedings of the 2nd International Conference on AI-generated Content, 2024. [3]
- ↑ Lutfullaev, J. "HumanEval on LLMs Revisited in Late 2023". arXiv:2402.14852, 2023. [4]
- ↑ Sato, M. et al. "CL-HumanEval: A Benchmark for Evaluating Cross-Lingual Transfer through Code Generation". Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation, 2024. [5]
- ↑ Athiwaratkun, B. et al. "Multilingual HumanEval: A Multilingual Code Generation Benchmark". arXiv:2307.11892, 2023. [6]
- ↑ Liu, J. et al. "HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization". LREC-COLING 2024. [7]