HellaSwag Benchmark

HellaSwag is a benchmark dataset introduced in 2019 to evaluate the ability of artificial intelligence models to perform commonsense natural language inference — that is, to choose the most plausible continuation of an everyday situation described in natural language^[1]. It was developed by Rowan Zellers, Yejin Choi and colleagues at the University of Washington and the Allen Institute for Artificial Intelligence (AI2). The name is a backronym for Harder Endings, Longer contexts, and Low-shot Activities combined with SWAG (Situations With Adversarial Generations), the earlier dataset that it extends^[2].

The task in HellaSwag is to choose the most plausible completion for a given text context. The key feature of the dataset is that it is trivial for humans yet challenging for strong pretrained models: the authors argued that such models could exploit dataset-specific statistical patterns rather than perform robust commonsense inference^[2].

History and Background

HellaSwag is a follow-up to the SWAG (Situations With Adversarial Generations) dataset, created in 2018 by an overlapping group of authors (Zellers, Bisk, Schwartz and Choi). In the SWAG task, models had to select the most likely continuation for a description of a simple situation. Initially, SWAG was hard for algorithms — then-state-of-the-art models scored below 60%, against human accuracy of about 88% — but with the advent of the BERT model, performance on SWAG reached over 86%, nearly matching humans^[2].

This success raised a question: does BERT truly "understand" the text, or has it merely learned to recognize statistical artifacts present in the dataset? The authors of HellaSwag hypothesized that BERT's high score reflected sensitivity to dataset-specific biases rather than general understanding. They demonstrated this by building a harder dataset: when the data distribution shifts even slightly, and even within the same domain, performance drops sharply — a BERT-Large model trained on SWAG scored only 34.6% when transferred to HellaSwag. This showed that, to objectively track progress in NLP, a new and more difficult benchmark was needed^[2].

Dataset Description and Goals

HellaSwag was created to expose the limitations of contemporary models on everyday, physically grounded situations that are simple for people.

Task Structure

Each example in HellaSwag consists of two parts:

Context: A short passage (about three sentences) describing the beginning of a situation.
Four completion options: Four candidate endings, each about two sentences long.

Only one of these endings is the real (correct) continuation; the other three are machine-generated distractors, selected specifically to fool models.

Data Sources

The situations were drawn from two datasets covering a wide range of everyday scenarios:

ActivityNet Captions: Descriptions of actions from video clips (for example, a person opening a jar of pickles). Endings from this source are relatively short.
WikiHow: Instructions from how-to articles (for example, how to change a car tire). This source provides longer and more varied contexts and continuations.

The goal of HellaSwag is to build a benchmark that humans solve intuitively but that maximally challenges models lacking commonsense inference. To make the distractors deceptive, the authors targeted a Goldilocks zone of text length and complexity: endings long enough that their implausibility is obvious to a human, yet not so long that the discriminator model can easily detect the machine-generated text from statistical cues. In practice this corresponds to roughly three sentences of context followed by two-sentence endings^[2].

Size and splits

The released dataset contains about 70,000 examples. After filtering, the authors kept the 25,000 best ActivityNet contexts and the 45,000 best WikiHow contexts. The validation and test sets each contain 10,000 examples, and each is split evenly into an in-domain subset and a zero-shot subset of 5,000 examples; the zero-shot subsets measure generalization to activity or how-to categories not seen during training^[2].

Adversarial Filtering (AF) Method

The key innovation behind HellaSwag is Adversarial Filtering (AF) — an iterative procedure that selects machine-generated "traps" which are hard for a discriminator model while remaining obvious to humans. In the original work, the incorrect endings were produced by a fine-tuned version of the first OpenAI GPT model, and a BERT-Large classifier served as the discriminator (the "victim").

The AF process works as follows:

Generation. Given a context, the generator language model produces many candidate incorrect endings.
Discrimination. The discriminator model (BERT-Large) attempts to distinguish the generated continuations from the real one.
Selection. The false options that the discriminator finds most plausible — the ones it is most likely to accept as real — are kept.
Iteration. The process is repeated, with new discriminators trained on fresh splits, so that the resulting dataset stays hard regardless of the final train/test split.
Human Verification. Finally, the resulting sets (context + one correct ending + the best distractors) are checked by human annotators, who confirm that the correct option is clearly the most natural and that the alternatives contain some human-noticeable implausibility^[2].

Thanks to AF, each example in HellaSwag is constructed to mislead models while remaining transparent to humans.

Results and Significance

HellaSwag became a demanding test for text-understanding models, revealing a large gap between machine and human performance:

Humans solve HellaSwag almost flawlessly, with accuracy of about 95–96% (95.6% on the test set)^[2].
The best model at the time of release, BERT-Large, reached only 47.3% overall. Models without strong pretraining performed close to random guessing (25%)^[2].

The gap of more than 45 percentage points supported the hypothesis that high scores on earlier tests did not necessarily reflect robust commonsense inference. HellaSwag showed that even after training on large amounts of data, models did not reliably generalize commonsense reasoning to new situations.

In the following years, HellaSwag became one of the standard benchmarks for new language models, and progress in the field could be tracked by performance on it:

In 2020, GPT-3 (175 billion parameters) reached about 79% accuracy (79.3%) in few-shot mode — surpassing many specialized models of the era but still well below human performance^[3].
By 2023, GPT-4 reported human-comparable performance on HellaSwag — 95.3% in a 10-shot setting, close to the human baseline of 95.6%^[4].

The creation of HellaSwag illustrated an approach to evaluation based on evolving benchmarks: as models improve, new and harder tests are needed to expose their remaining weaknesses.

Limitations and current status

HellaSwag was highly discriminative in 2019, but frontier models have since approached the human baseline: GPT-4 scores 95.3% against a human baseline of 95.6%, and strong open models such as Llama 3 and Qwen routinely exceed 85–90%. As a result, HellaSwag is now used less as a probe of the limits of top-tier models and more as a basic sanity check, while remaining a standard component of evaluation frameworks (such as the Language Model Evaluation Harness) and leaderboards (such as the Hugging Face Open LLM Leaderboard)^[5].

This near-saturation reflects broader concerns in language-model evaluation, including benchmark saturation and the risk that test data leaks into training corpora (data contamination). Later studies have also questioned the quality of some HellaSwag items, pointing to annotation and label errors that can affect measured accuracy. These issues motivate the continued development of newer and more robust benchmarks.

External links

Literature

Zellers, R.; Bisk, Y.; Schwartz, R.; Choi, Y. (2018). SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. arXiv:1808.05326.
Liang, P. et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
Chang, Y. et al. (2023). A Survey on Evaluation of Large Language Models. arXiv:2307.03109.
Ni, S. et al. (2025). A Survey on Large Language Model Benchmarks. arXiv:2508.15361.
Biderman, S. et al. (2024). The Language Model Evaluation Harness (lm-eval): Guidance and Lessons Learned. arXiv:2405.14782.
Kiela, D. et al. (2021). Dynabench: Rethinking Benchmarking in NLP. arXiv:2104.14337.
Xu, C. et al. (2024). Benchmark Data Contamination of Large Language Models: A Survey. arXiv:2406.04244.

References

↑ Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; Choi, Y. "HellaSwag: Can a Machine Really Finish Your Sentence?". Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4791–4800. [1]
↑ ^2.0 ^2.1 ^2.2 ^2.3 ^2.4 ^2.5 ^2.6 ^2.7 ^2.8 Zellers, R. et al. "HellaSwag: Can a Machine Really Finish Your Sentence?". arXiv:1905.07830, 2019. [2]
↑ Brown, T. B. et al. "Language Models are Few-Shot Learners". arXiv:2005.14165, 2020. [3]
↑ OpenAI. "GPT-4 Technical Report". arXiv:2303.08774, 2023. [4]
↑ Zellers, R. et al. "HellaSwag Project Page". [5]

[hellaswag_paper-1] Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; Choi, Y. "HellaSwag: Can a Machine Really Finish Your Sentence?". Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4791–4800. [1]

[hellaswag_arxiv-2] 2.0 ^2.1 ^2.2 ^2.3 ^2.4 ^2.5 ^2.6 ^2.7 ^2.8 Zellers, R. et al. "HellaSwag: Can a Machine Really Finish Your Sentence?". arXiv:1905.07830, 2019. [2]

[gpt3_paper-3] Brown, T. B. et al. "Language Models are Few-Shot Learners". arXiv:2005.14165, 2020. [3]

[gpt4_report-4] OpenAI. "GPT-4 Technical Report". arXiv:2303.08774, 2023. [4]

[hellaswag_official_site-5] Zellers, R. et al. "HellaSwag Project Page". [5]

[1]

[2]

[3]

[4]

[5]

HellaSwag Benchmark

Contents

History and Background

Dataset Description and Goals

Task Structure

Data Sources

Size and splits

Adversarial Filtering (AF) Method

Results and Significance

Limitations and current status

External links

See also

Literature

References

Navigation menu

HellaSwag Benchmark

History and Background

Dataset Description and Goals

Task Structure

Data Sources

Size and splits

Adversarial Filtering (AF) Method

Results and Significance

Limitations and current status

External links

See also

Literature

References

Navigation menu

Search