WinoGrande Benchmark

From Systems Analysis Wiki
Jump to navigation Jump to search

WinoGrande is a large-scale benchmark dataset designed to evaluate the common-sense reasoning abilities of artificial intelligence systems. It contains approximately 44,000 problems based on the Winograd Schema Challenge (WSC) format, but significantly expanded and made more difficult using an "adversarial" filtering method to eliminate statistical cues[1].

The dataset was developed in 2019 by a group of researchers from the Allen Institute for AI and the University of Washington. Each problem presents a sentence with a blank that must be filled with one of two choices, where the correct one is selected based on context and an understanding of the situation. WinoGrande has become one of the key benchmarks in the field of Natural Language Processing (NLP)[2].

Background: The Obsolescence of WSC

The original Winograd Schema Challenge (WSC), proposed in 2011, contained only 273 problems and was long considered a reliable test of common sense. Its problems were designed to require an understanding of the world, not just simple word matching[3].

However, by 2018–2019, with the advent of large language models based on the Transformer architecture, such as BERT, the situation changed. Models learned to "hack" the test, achieving accuracies of around 90% by exploiting unintentional statistical patterns (artifacts) in the data, rather than through genuine understanding[4]. WSC was no longer a reliable indicator, which led to the need to create a new, more difficult, and larger-scale benchmark—WinoGrande.

Development and the Adversarial Filtering Method

The creation of WinoGrande involved two main stages: mass generation of problems and their subsequent filtering.

Crowdsourcing

In the first stage, a large database of over 47,000 sentences was collected using the Amazon Mechanical Turk platform. Crowdworkers created pairs of sentences following the Winograd schema, which provided linguistic diversity and the "noise" characteristic of natural speech, unlike problems written by a small group of experts[1].

The AfLite Algorithm

The key innovation of WinoGrande was the AfLite (Adversarial Filtering Lite) algorithm. This method was developed to automatically filter out problems that could be solved using simple statistical cues without requiring common sense. The algorithm used simple models to identify and remove examples where one of the answers was too obviously associated with other words in the sentence. For example, the problem "The lions ate the zebras because they are predators" would be filtered out, as the word "predators" is statistically strongly associated with "lions."

As a result of this filtering, about 14% of the collected data was discarded. The final version of the dataset includes 43,972 problems, making it a significantly more reliable and challenging test[1].

Model Results and Progress

Upon WinoGrande's release, the best models at the time showed results significantly inferior to human performance.

  • RoBERTa (an improved version of BERT) achieved an accuracy of ~79%.
  • Humans, on average, solve the problems with an accuracy of ~94%[1].

This gap confirmed that the AfLite filtering had successfully eliminated many of the "easy" paths for the models. However, with the development of LLMs, this gap began to shrink.

  • By 2022, the ST-MoE-32B model reached 96.1% accuracy, surpassing the human level[5].
  • GPT-3 showed a result of about 88%[6].
  • GPT-4, without special fine-tuning, solves the problems with an accuracy of ~87.5%[7].

Impact and Criticism

WinoGrande has become one of the key benchmarks for evaluating common sense and is regularly used to test new models. Its results are published in the technical reports of leading AI companies and on model comparison platforms[8].

At the same time, the dataset's creation methodology has become a subject of academic debate. Some researchers note that mass crowdsourcing may have led to the creation of unnatural or ambiguous phrases. Doubts have also been raised as to whether the automated AfLite filtering can completely eliminate all hidden artifacts[5]. Nevertheless, WinoGrande has stimulated not only progress in metrics but also an important discussion about creating more robust and reliable methods for AI evaluation.

Further Reading

  • Liang, P. et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
  • Chang, Y. et al. (2023). A Survey on Evaluation of Large Language Models. arXiv:2307.03109.
  • Ni, S. et al. (2025). A Survey on Large Language Model Benchmarks. arXiv:2508.15361.
  • Biderman, S. et al. (2024). The Language Model Evaluation Harness (lm-eval): Guidance and Lessons Learned. arXiv:2405.14782.
  • Kiela, D. et al. (2021). Dynabench: Rethinking Benchmarking in NLP. arXiv:2104.14337.
  • Ma, Z. et al. (2021). Dynaboard: An Evaluation‑As‑A‑Service Platform for Holistic Next‑Generation Benchmarking. arXiv:2106.06052.
  • Goel, K. et al. (2021). Robustness Gym: Unifying the NLP Evaluation Landscape. arXiv:2101.04840.
  • Xu, C. et al. (2024). Benchmark Data Contamination of Large Language Models: A Survey. arXiv:2406.04244.
  • Liu, S. et al. (2025). A Comprehensive Survey on Safety Evaluation of LLMs. arXiv:2506.11094.
  • Chiang, W.-L. et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
  • Boubdir, M. et al. (2023). Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. arXiv:2311.17295.
  • Huang, L. et al. (2023). A Survey on Hallucination in Large Language Models. arXiv:2311.05232.

References

  1. 1.0 1.1 1.2 1.3 Sakaguchi, K., Le Bras, R., Bhagavatula, C., Choi, Y. «WinoGrande: An Adversarial Winograd Schema Challenge at Scale». arXiv:1907.10641. [1]
  2. «allenai/winogrande». Hugging Face. [2]
  3. «Winograd schema challenge». In Wikipedia. [3]
  4. Kocijan, V. et al. «The defeat of the Winograd Schema Challenge». Artificial Intelligence. [4]
  5. 5.0 5.1 Lepore, J. «AI Has Been Surprising for Years». Carnegie Endowment for International Peace. [5]
  6. Brown, T. et al. «Language Models are Few-Shot Learners». arXiv:2005.14165. [6]
  7. OpenAI. «GPT-4 Technical Report». arXiv:2303.08774. [7]
  8. «Common Sense Reasoning On Winogrande». HyperAI. [8]