BLEU (Bilingual Evaluation Understudy)

From Systems Analysis Wiki
Jump to navigation Jump to search

BLEU (Bilingual Evaluation Understudy) is an algorithm for automatically evaluating the quality of machine-translated text. The evaluation is performed by comparing a candidate translation with one or more reference human translations[1]. Quality is determined by the degree of lexical similarity between the machine translation and a professional translation. As the authors noted, "the closer a machine translation is to a professional human translation, the better it is"[2].

The method was proposed in 2002 by a group of IBM researchers led by Kishore Papineni and became one of the first metrics to show a high correlation with the judgments of expert human translators. BLEU quickly gained popularity due to its ease of calculation, language independence, and good correlation with human judgment at the text corpus level[1].

How BLEU is Calculated

BLEU evaluates a translation by counting matching n-grams (sequences of n words) between the candidate translation and the reference translations.

1. Modified n-gram Precision

First, for n-grams of different lengths (usually from 1 to 4), their precision (pn) is calculated. This is the fraction of n-grams in the candidate translation that appear in the reference translations[3]. The number of matches for each n-gram is clipped to the maximum number of times it appears in any single reference text to avoid inflating the score for repeating the same word.

2. Aggregation and Geometric Mean

To obtain a single score, the precisions for 1-, 2-, 3-, and 4-grams are aggregated using a geometric mean. This is done so that low precision for one type of n-gram (e.g., 4-grams) significantly impacts the final score, reflecting poor quality in longer phrases. p1p2p3p44

3. Brevity Penalty

To prevent inflated scores for translations that are too short but precise, BLEU introduces a Brevity Penalty (BP). If the length of the candidate translation (c) is significantly less than the length of the reference translation (r), the final BLEU score is reduced. The penalty is calculated using the formula: BP={1if c>re1r/cif cr

4. Final BLEU Formula

The final BLEU score is calculated as the product of the brevity penalty and the geometric mean of the n-gram precisions[4]: BLEU=BPexp(n=1Nwnlogpn) where N is the maximum n-gram length (usually 4), and wn are the weights (usually 1/N).

The BLEU score ranges from 0 to 1 (often multiplied by 100 and expressed as a percentage). The closer the result is to 1 (or 100%), the more "human-like" the translation is considered to be.

Application and Significance

Since its publication, the BLEU metric has become the de facto standard for evaluating machine translation (MT) systems. It helped overcome a bottleneck in the development of MT systems: the time-consuming and expensive process of manual evaluation. Developers gained the ability to quickly measure the impact of changes to their models and promptly discard unsuccessful solutions[2].

BLEU correlates well with human judgments at the corpus level but is unreliable for evaluating individual sentences[3]. Therefore, the metric has been widely used in standardized MT competitions (such as NIST and WMT) to compare systems.

Limitations and Criticism

Despite its widespread adoption, BLEU has several significant limitations:

  • Lack of Semantic Evaluation: BLEU only measures surface-level word overlap and cannot assess whether the meaning of the source text has been correctly conveyed. A translation can receive a high score but be grammatically incorrect or distort the meaning[5].
  • Ignores Synonyms and Paraphrasing: The algorithm penalizes translations that use synonyms or different phrasing than the reference, even if they are perfectly correct. Using multiple references mitigates but does not completely solve this problem.
  • Sensitivity to Tokenization: BLEU scores are highly dependent on how the text is split into tokens. Different tokenizer implementations can lead to different values, making model comparisons inaccurate. To address this issue, the SacreBLEU standard was proposed to unify the metric's calculation[1].
  • Difficulty with Certain Languages: BLEU performs poorly with languages that do not have clear word delimiters (such as Chinese or Japanese) without prior segmentation.

Alternatives and Modern Approaches

Over time, new automatic metrics were proposed to overcome the shortcomings of BLEU:

  • METEOR: Accounts for synonym matches, stemming, and word order.
  • ROUGE: Used for evaluating text summarization, focusing on recall rather than precision.
  • Learned Metrics: Modern approaches that use machine learning models to account for semantic similarity. Metrics such as BLEURT and COMET show a significantly higher correlation with human judgments than classic BLEU.

By the 2020s, BLEU had lost its status as the undisputed standard, giving way to more accurate methods[6]. Nevertheless, it remains an important milestone in the history of MT evaluation and continues to be used as a baseline for measuring progress.

Notes

  1. 1.0 1.1 1.2 "BLEU". Wikipedia. [1]
  2. 2.0 2.1 Papineni, Kishore, et al. "Bleu: a Method for Automatic Evaluation of Machine Translation". Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002. [2]
  3. 3.0 3.1 "BLEU". MT Companion 4.0 documentation. [3]
  4. Callison-Burch, Chris, et al. "BLEU: a Method for Automatic Evaluation of Machine Translation". Proceedings of the EACL 2006 Workshop on Statistical Machine Translation, 2006. [4]
  5. Cardete, Jorge. "Beyond BLEU Score. When it comes to the nuanced world of...". The Deep Hub | Medium. [5]
  6. "Chief Digital and Artificial Intelligence Office > Lexicon". ai.mil. [6]