Perplexity (metric)

Perplexity (PPL) in information theory and machine learning is a measurement of the uncertainty or "surprise" of a language model when predicting a sample of text. Low perplexity indicates that the model's probability distribution is a good fit for the test data, whereas high perplexity means the model predicts the sequence poorly.

Formally, the perplexity of a probability distribution is defined as the exponentiation of its entropy. For a discrete probability distribution $p (x)$ , it is $2^{H (p)}$ , where $H (p)$ is the entropy^[1]. Intuitively, perplexity can be understood as the "effective" number of choices a model has to pick from at each step. If the perplexity is 100, it means the model's uncertainty is equivalent to choosing from 100 equally probable outcomes^[2].

The term was first introduced in 1977 by a group of IBM researchers led by Frederick Jelinek in the context of statistical speech recognition to quantify the "difficulty" of a task^[3].

Perplexity in Language Models

In the field of Natural Language Processing (NLP), perplexity has become a standard intrinsic metric for evaluating the quality of language models. It measures how well a model predicts a sequence of words or tokens in a test dataset.

Formal Definition

For a test corpus $W = w_{1} w_{2} \dots w_{N}$ and a language model $q$ , perplexity is calculated as the inverse geometric mean of the test corpus probability, normalized by the number of words: $PP (W, q) = {(\prod_{i = 1}^{N} \frac{1}{q (w_{i} ∣ w_{1}, \dots, w_{i - 1})})}^{1 / N}$

This formula is equivalent to the exponentiation of the cross-entropy, or the average negative log-likelihood: $PP (W, q) = \exp (- \frac{1}{N} \sum_{i = 1}^{N} \log q (w_{i} ∣ context))$

Minimizing perplexity is equivalent to maximizing the likelihood of the model on the test data. Therefore, a model with lower perplexity is considered statistically more accurate^[4].

Historical Application and Modern LLMs

Historically, perplexity was widely used to evaluate statistical n-gram models. For example, on the Wall Street Journal corpus, a unigram model (which only considers word frequencies) has a perplexity of ~962, while a trigram model (which considers the context of the two preceding words) has a perplexity of about 109^[5]. This sharp decrease demonstrates how much better the model captures linguistic patterns.

With the development of large language models (LLMs), perplexity has maintained its role as a fundamental benchmark. Researchers report perplexity on standard test sets (e.g., WikiText) as a measure of the model's "fluency." For instance, the OpenAI paper on GPT-2 states that the model with ~117M parameters achieves a perplexity of about 37 on the WikiText-103 corpus^[6]. A decrease in perplexity generally correlates with an improvement in model quality, making the metric a convenient indicator of progress during training and optimization.

Limitations and Interpretation of the Metric

Although low perplexity indicates a high likelihood of the data according to the model, this metric has several significant limitations and does not always correlate with the actual quality of the generated text.

Low perplexity ≠ high quality. Perplexity measures a model's confidence in its predictions, not their correctness or factuality. A model can be confidently wrong, generating nonsensical but statistically probable text (for example, by repeating very common words and phrases)^[2].
Sensitivity to data and tokenization. Perplexity is not well-suited for directly comparing models with different architectures, vocabularies, or tokenization methods. For example, a character-level model might have a numerically lower perplexity than a word-level model, but this does not mean it is better at solving language tasks^[2].
Inability to evaluate semantics and long context. Perplexity is a local metric that evaluates the prediction of the next token. It correlates poorly with a model's ability to capture long-range dependencies and semantic context over large distances. A 2023 study (Hu et al.) showed that an LLM's ability to understand long texts (up to 100k tokens) is barely reflected in the perplexity metric^[7].
Susceptibility to manipulation. The metric can be "gamed." An overfitted model will show an artificially low perplexity on data it has "memorized." Research (Wang et al., 2022) has also shown that duplicating text fragments or even omitting a period at the end of a sentence can unjustifiably lower or raise perplexity without affecting the actual quality of the text^[8].

Conclusion: The Role of Perplexity Today

Given these limitations, in modern practice, perplexity is considered an auxiliary, preliminary indicator of a language model's quality. It remains a valuable tool for quick evaluation and debugging of models, as it is task-agnostic and easy to compute^[2].

However, for a comprehensive evaluation of an LLM, perplexity alone is not sufficient. Today, it is always supplemented with extrinsic metrics tied to specific tasks, such as:

Accuracy on question-answering tasks;
Human evaluation;
BLEU/ROUGE for machine translation and summarization.

In conjunction with these methods, perplexity continues to play an important role as an objective measure of a model's "surprise," but its results must always be interpreted with its limitations in mind^[2].

Links

Hugging Face documentation on calculating perplexity

Literature

Jelinek, F., Bahl, L. R., & Mercer, R. L. (1977). Perplexity — a Measure of the Difficulty of Speech Recognition Tasks. JASA:62(S1):S63.
Jurafsky, D., & Martin, J. H. (2021). Speech and Language Processing (3rd ed., Ch. 3: N-gram Language Models). PDF.
Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI white paper.
Brown, T. B. et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165.
Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.
Wang, C. et al. (2022). Perplexity by PLM Is Unreliable for Evaluating Text Quality. arXiv:2210.05892.
Meister, C., & Cotterell, R. (2021). Language Model Evaluation Beyond Perplexity. arXiv:2106.00085.
Hu, Y. et al. (2024). Can Perplexity Reflect Large Language Model's Ability in Long Text Understanding?. arXiv:2405.06105.
Lazaridou, A. et al. (2021). Mind the Gap: Assessing Temporal Generalization in Neural Language Models. NeurIPS 2021.

Notes

↑ "Perplexity". Wikipedia. [1]
↑ ^2.0 ^2.1 ^2.2 ^2.3 ^2.4 Morgan, Abby. "Perplexity for LLM Evaluation". Comet AI Blog, 21 Nov 2023. [2]
↑ "README.md · evaluate-measurement/perplexity". Hugging Face. [3]
↑ Jurafsky, Dan, and James H. Martin. Speech and Language Processing, 3rd ed., Chapter 3: N-gram Language Models, draft (2021). [4]
↑ Jurafsky, Dan, and James H. Martin. Speech and Language Processing, 3rd ed., Chapter 3: N-gram Language Models, draft (2021). [5]
↑ "Perplexity number of wikitext-103 on gpt-2 don't match the paper". GitHub, huggingface/transformers, Issue #483. [6]
↑ Hu, H., et al. "Can Perplexity Reflect Large Language Model's Ability in Long Text Understanding?". arXiv:2405.06105 [cs.CL], May 10, 2024. [7]
↑ Wang, C., et al. "Perplexity by PLM Is Unreliable for Evaluating Text Quality". arXiv:2210.05892 [cs.CL], Oct 12, 2022. [8]

[en_wiki-1] "Perplexity". Wikipedia. [1]

[comet_eval-2] 2.0 ^2.1 ^2.2 ^2.3 ^2.4 Morgan, Abby. "Perplexity for LLM Evaluation". Comet AI Blog, 21 Nov 2023. [2]

[readme_jelinek-3] "README.md · evaluate-measurement/perplexity". Hugging Face. [3]

[stanford_slp-4] Jurafsky, Dan, and James H. Martin. Speech and Language Processing, 3rd ed., Chapter 3: N-gram Language Models, draft (2021). [4]

[jurafsky_slp3-5] Jurafsky, Dan, and James H. Martin. Speech and Language Processing, 3rd ed., Chapter 3: N-gram Language Models, draft (2021). [5]

[gpt2_issue-6] "Perplexity number of wikitext-103 on gpt-2 don't match the paper". GitHub, huggingface/transformers, Issue #483. [6]

[hu_long_context-7] Hu, H., et al. "Can Perplexity Reflect Large Language Model's Ability in Long Text Understanding?". arXiv:2405.06105 [cs.CL], May 10, 2024. [7]

[wang_unreliable-8] Wang, C., et al. "Perplexity by PLM Is Unreliable for Evaluating Text Quality". arXiv:2210.05892 [cs.CL], Oct 12, 2022. [8]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Perplexity (metric)

Contents

Perplexity in Language Models

Formal Definition

Historical Application and Modern LLMs

Limitations and Interpretation of the Metric

Conclusion: The Role of Perplexity Today

Links

Literature

Notes

Navigation menu

Perplexity (metric)

Perplexity in Language Models

Formal Definition

Historical Application and Modern LLMs

Limitations and Interpretation of the Metric

Conclusion: The Role of Perplexity Today

Links

Literature

Notes

Navigation menu

Search