HellaSwag Benchmark
HellaSwag is a benchmark dataset introduced in 2019 to evaluate the ability of artificial intelligence models for commonsense reasoning in natural language[1]. The benchmark was developed by a team of researchers from the University of Washington and the Allen Institute for Artificial Intelligence.
The task in HellaSwag is to choose the most plausible completion for a given text context. The key feature of the dataset is that it is trivial for humans but challenging even for advanced language models that rely on superficial statistical patterns[2].
History and Background
HellaSwag is an extension of the ideas from the SWAG (Situations With Adversarial Generations) dataset, proposed by the same group of authors in 2018. In the SWAG task, models were required to select the most likely continuation for a description of a simple situation. Initially, SWAG was difficult for algorithms, but with the advent of the BERT model, its performance on SWAG reached ~86%, nearly matching human performance[2].
This success raised doubts: does BERT truly "understand" the text, or has it merely learned to recognize statistical artifacts and patterns present in the dataset? The authors of HellaSwag hypothesized that BERT's high score was due not to genuine understanding but to overfitting to the specifics of the dataset. They showed that even with a slight change in the data distribution, BERT's accuracy dropped sharply. This meant that to objectively evaluate progress in NLP, a new, more difficult and "tricky" benchmark was needed[2].
Dataset Description and Goals
HellaSwag was created as a test designed to reveal the limitations of modern models in understanding cause-and-effect relationships and everyday scenarios.
Task Structure
Each example in HellaSwag consists of two parts:
- Context: A short paragraph (up to three sentences) describing the beginning of a situation.
- Four completion options: Four possible endings to the story, also consisting of several sentences.
Only one of these endings is correct (the real one), while the other three are false, generated specifically to confuse the model.
Data Sources
The situations were sourced from two datasets covering a wide range of everyday scenarios:
- ActivityNet Captions: Descriptions of actions from video clips (e.g., "a person opens a jar of pickles").
- WikiHow: Instructions from articles (e.g., "how to change a car tire").
The goal of HellaSwag is to create a benchmark that is easily solved by humans (intuitively) but poses maximum difficulty for models that lack genuine commonsense reasoning. The authors called this the Goldilocks effect[1].
Adversarial Filtering (AF) Method
The key innovation in creating HellaSwag was the Adversarial Filtering (AF) method—an iterative process of selecting "traps" designed for a specific "victim" model. This method allowed for the creation of false options that are deceptively similar to the correct ones from the perspective of statistical models.
The AF process works as follows:
- Generation. Based on the initial context, a generator language model (e.g., GPT) creates numerous potential incorrect endings.
- Discrimination. A classifier model (e.g., BERT), acting as the "victim," attempts to distinguish the generated continuations from the real (correct) one.
- Selection. The false options that the classifier deemed most plausible are selected—that is, the ones it was most likely to get wrong.
- Iteration. The process is repeated multiple times until the incorrect answers become maximally similar to the correct one for the algorithm.
- Human Verification. In the final stage, the resulting sets (context + 1 correct ending + 3 best false endings) are evaluated by humans. The evaluators confirm that the correct option is unambiguously the most natural one and that all alternatives contain some form of illogicality noticeable to a person[2].
Thanks to AF, each example in HellaSwag is initially constructed to mislead the model while remaining transparent to humans.
Results and Significance
HellaSwag became a rigorous test for text understanding models. The test results revealed a huge gap between machine and human intelligence:
- Humans solve HellaSwag tasks almost flawlessly, with an accuracy of about 95-96%[2].
- The best model at the time of its creation, BERT-Large, achieved only ~47% accuracy. Simpler methods performed not much better than random guessing (25%)[2].
The gap of more than 45 percentage points confirmed the hypothesis that high scores on previous tests did not signify genuine understanding. HellaSwag demonstrated that even after training on vast amounts of data, models could not develop general commonsense reasoning for new situations.
In subsequent years, HellaSwag became one of the standard benchmarks for new language models. The progress of AI systems could be tracked by their performance on this benchmark.
- In 2020, the GPT-3 model (175 billion parameters) showed an accuracy of ~79% in few-shot mode, surpassing many specialized models of that era but still significantly lagging behind human performance[3].
- It wasn't until 2023 that new-generation models like GPT-4 were able to achieve human-comparable results on HellaSwag (around 95% accuracy)[4].
The creation of HellaSwag marked a new approach to evaluating progress in NLP, based on the idea of evolving benchmarks: as models improve, it is necessary to create new, more challenging tests that expose their weaknesses.
Links
- Official HellaSwag project website
- Research paper "HellaSwag: Can a Machine Really Finish Your Sentence?"
Further Reading
- Liang, P. et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
- Chang, Y. et al. (2023). A Survey on Evaluation of Large Language Models. arXiv:2307.03109.
- Ni, S. et al. (2025). A Survey on Large Language Model Benchmarks. arXiv:2508.15361.
- Biderman, S. et al. (2024). The Language Model Evaluation Harness (lm-eval): Guidance and Lessons Learned. arXiv:2405.14782.
- Kiela, D. et al. (2021). Dynabench: Rethinking Benchmarking in NLP. arXiv:2104.14337.
- Ma, Z. et al. (2021). Dynaboard: An Evaluation‑As‑A‑Service Platform for Holistic Next‑Generation Benchmarking. arXiv:2106.06052.
- Goel, K. et al. (2021). Robustness Gym: Unifying the NLP Evaluation Landscape. arXiv:2101.04840.
- Xu, C. et al. (2024). Benchmark Data Contamination of Large Language Models: A Survey. arXiv:2406.04244.
- Liu, S. et al. (2025). A Comprehensive Survey on Safety Evaluation of LLMs. arXiv:2506.11094.
- Chiang, W.-L. et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
- Boubdir, M. et al. (2023). Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. arXiv:2311.17295.
- Huang, L. et al. (2023). A Survey on Hallucination in Large Language Models. arXiv:2311.05232.
References
- ↑ 1.0 1.1 Zellers, R. et al. "HellaSwag: Can a Machine Really Finish Your Sentence?". Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. [1]
- ↑ 2.0 2.1 2.2 2.3 2.4 2.5 Zellers, R. et al. "HellaSwag: Can a Machine Really Finish Your Sentence?". arXiv:1905.07830, 2019. [2]
- ↑ Brown, T. B. et al. "Language Models are Few-Shot Learners". arXiv:2005.14165, 2020. [3]
- ↑ Zellers, R. et al. "HellaSwag Project Page". [4]