GLUE Benchmark
GLUE (an acronym for General Language Understanding Evaluation) is a multi-task benchmark for evaluating the performance of Natural Language Understanding (NLU) models. The benchmark was proposed in 2018 by a group of researchers from New York University, the University of Washington, and DeepMind, including Alex Wang and Samuel Bowman, and it has gained widespread adoption in the research community[1].
The primary goal of GLUE is to provide a single, neutral, and challenging test suite for comparatively evaluating the capabilities of NLU models on a diverse set of tasks that go beyond any single specific domain. The benchmark includes an online platform with a leaderboard, which ensures an objective comparison of models and prevents overfitting to the test data, as the true labels for a portion of the tests are not made public and are only accessible through the evaluation server. It is posited that to achieve high scores, a model must be able to extract universal language representations and effectively transfer knowledge across different types of tasks.
Benchmark Composition and Tasks
The GLUE benchmark combines nine different language understanding tasks based on existing datasets that are challenging for AI. All tasks are formulated as classification or regression over a single sentence or a pair of sentences[1].
- CoLA (Corpus of Linguistic Acceptability) — A task to determine the grammatical acceptability of a sentence. The evaluation metric is the Matthews correlation coefficient.
- SST-2 (Stanford Sentiment Treebank) — A sentiment analysis task (positive/negative) for movie reviews. The metric is accuracy.
- MRPC (Microsoft Research Paraphrase Corpus) — A task to identify paraphrases in pairs of sentences from news sources. The metrics are accuracy and the F1-score.
- QQP (Quora Question Pairs) — A task to determine if a pair of questions from Quora are duplicates. The metrics are accuracy and the F1-score.
- STS-B (Semantic Textual Similarity Benchmark) — A semantic similarity task for two sentences. The model must predict the degree of semantic closeness on a scale from 1 to 5. The metrics are the Pearson and Spearman correlation coefficients.
- MNLI (Multi-Genre Natural Language Inference) — A natural language inference task on pairs of sentences from various genres (entailment, contradiction, neutral). Results are evaluated separately on matched and mismatched subsets.
- QNLI (Question Natural Language Inference) — A task derived from the SQuAD dataset. It requires determining whether a sentence from a paragraph contains the answer to a given question.
- RTE (Recognizing Textual Entailment) — A combined dataset for textual entailment, aggregating several smaller collections. The task is to classify the relationship between sentences as a binary choice.
- WNLI (Winograd NLI) — A modified version of the Winograd Schema Challenge, adapted to the NLI format. This is an anaphora resolution task: the system is given a sentence with an ambiguous pronoun and must identify which of two entities it refers to.
Evaluation Methodology
To be evaluated on GLUE, researchers submit their model's predictions to a dedicated server, after which they receive an automated calculation of metrics for each task and an aggregate score.
- GLUE-score — The final metric, calculated as the average of the results across all nine core tasks.
- Leaderboard — A public table that reflects the current state-of-the-art and shows which models perform best on NLU tasks. The use of hidden test sets ensures fair comparison.
- Diagnostic Set — A special set of 1100 examples, manually annotated by experts for fine-grained linguistic analysis. It does not affect the rankings but serves as a qualitative analysis tool to test which linguistic phenomena (lexical semantics, logic, common sense) a model can handle and where it struggles[1].
Results and Industry Impact
When GLUE was launched in 2018, the best models at the time (e.g., BiLSTM with ELMo) achieved an aggregate score of around 70 points (on a scale of 0–100), which was significantly below human-level performance (around 87 points)[2].
The introduction of GLUE and its public leaderboard spurred rapid progress in the field of transfer learning in NLP.
- By May 2019, in less than a year, a new generation of Transformer-based models (primarily BERT) had raised the state-of-the-art bar to 83.9 points.
- In the second half of 2019, the GLUE benchmark was effectively "solved": the top systems came very close to human-level performance, and even surpassed it on some tasks[3].
GLUE played a huge role as a single point of reference in the development of language understanding models. Thanks to GLUE, researchers were able to directly compare different architectures on a comprehensive set of tasks, identify the strengths and weaknesses of various approaches, and quickly share their achievements via the public leaderboard.
SuperGLUE: Subsequent Development
The rapid success of GLUE led to the same group of authors, in collaboration with colleagues from Facebook AI, introducing a new, more difficult benchmark called SuperGLUE just a year later[4].
SuperGLUE was announced in late 2019 as a "stickier" set of tests, designed to once again create a gap between the capabilities of state-of-the-art models and humans. It included eight tasks requiring even deeper language understanding, along with improved tools and rules for participants. Although GLUE is still used as a basic test, the main focus of competitive improvement has shifted to SuperGLUE and other, more specialized, benchmarks.
Links
Literature
- Liang, P. et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
- Chang, Y. et al. (2023). A Survey on Evaluation of Large Language Models. arXiv:2307.03109.
- Ni, S. et al. (2025). A Survey on Large Language Model Benchmarks. arXiv:2508.15361.
- Biderman, S. et al. (2024). The Language Model Evaluation Harness (lm-eval): Guidance and Lessons Learned. arXiv:2405.14782.
- Kiela, D. et al. (2021). Dynabench: Rethinking Benchmarking in NLP. arXiv:2104.14337.
- Ma, Z. et al. (2021). Dynaboard: An Evaluation‑As‑A‑Service Platform for Holistic Next‑Generation Benchmarking. arXiv:2106.06052.
- Goel, K. et al. (2021). Robustness Gym: Unifying the NLP Evaluation Landscape. arXiv:2101.04840.
- Xu, C. et al. (2024). Benchmark Data Contamination of Large Language Models: A Survey. arXiv:2406.04244.
- Liu, S. et al. (2025). A Comprehensive Survey on Safety Evaluation of LLMs. arXiv:2506.11094.
- Chiang, W.-L. et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
- Boubdir, M. et al. (2023). Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. arXiv:2311.17295.
- Huang, L. et al. (2023). A Survey on Hallucination in Large Language Models. arXiv:2311.05232.
Notes
- ↑ 1.0 1.1 1.2 Wang, A. et al. "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". arXiv:1804.07461, 2019. [1]
- ↑ Bowman, S. R. et al. "Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark". arXiv:1905.10425, 2019. [2]
- ↑ Wang, A. et al. "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems". NeurIPS 2019. [3]
- ↑ "AI models from Microsoft and Google already surpass human performance on the SuperGLUE language benchmark". VentureBeat. [4]