BERTScore (metric)

From Systems Analysis Wiki
Jump to navigation Jump to search

BERTScore is an automatic metric for evaluating the quality of generated text, based on measuring semantic similarity using contextual embeddings from pre-trained language models like BERT. The metric was proposed in 2019 by a group of researchers led by Tianyi Zhang in the paper "BERTScore: Evaluating Text Generation with BERT".[1]

Unlike traditional metrics such as BLEU and ROUGE, which are based on exact n-gram matching, BERTScore can identify semantic equivalence even when words and phrasing differ, taking into account synonyms and paraphrases.[2]

Calculation Method

The BERTScore method consists of several steps:

  1. Obtaining Contextual Embeddings: Both texts (the reference and the generated text) are tokenized and passed through a pre-trained transformer model (e.g., BERT or RoBERTa). A contextual vector representation (embedding) is extracted for each token.
  2. Calculating Cosine Similarity: Cosine similarity is calculated for all pairs of tokens from the two texts, forming a token similarity matrix.[3]
  3. Calculating Precision, Recall, and F1-Score: Based on the similarity matrix, the most similar token in the reference text is found for each token in the generated text, which allows for calculating precision. Similarly, the closest token in the generated text is found for each reference token, yielding recall. The final BERTScore value is the balanced F₁-score, which combines precision and recall:
   RBERT=1|x|xixmaxyjyxiTyj(Recall)
   PBERT=1|y|yjymaxxixxiTyj(Precision)
   FBERT=2PBERTRBERTPBERT+RBERT

The metric is flexible: different pre-trained models can be chosen, tokens can be weighted by their importance (using IDF weights), and the scores can be linearly transformed for better interpretability.[3]

Application and Effectiveness

BERTScore is used to evaluate quality in various text generation tasks:

  • Machine Translation: It captures the preservation of meaning, even if the translation's phrasing differs from the reference.
  • Automatic Summarization: It can determine that different phrasings convey the same key facts, making it more flexible than ROUGE.
  • Dialogue Systems: It helps measure the appropriateness of a response by comparing it to a reference at the semantic level.

A large-scale evaluation conducted by the authors showed that BERTScore's correlation coefficient with human judgments is significantly higher than that of metrics like BLEU and ROUGE. Additionally, the metric demonstrated enhanced robustness to complex cases of paraphrasing.[1]

Advantages

  • Semantic Awareness: Compares texts at the meaning level, accounting for synonyms and paraphrases.
  • High Correlation with Human Judgment: BERTScore's evaluations align better with human assessments of text quality than traditional metrics.
  • Versatility and Portability: The metric is not tied to a specific language or task; one only needs to select an appropriate pre-trained model.
  • No Training Required: BERTScore is a non-trainable metric, unlike more complex metrics (e.g., BLEURT) that require pre-training on evaluation corpora.
  • Integration of Modern Models: It leverages the power of transformers to extract deep contextual features.

Limitations and Criticism

  • High Computational Cost: Calculation based on embeddings requires significantly more resources than n-gram counting and often necessitates the use of a GPU.[2]
  • Model Dependency: The quality of the evaluation is directly linked to the quality of the pre-trained model. The choice of model and the layer from which embeddings are extracted affects the result, which can lead to reproducibility issues.[4]
  • Lack of Factual and Structural Awareness: BERTScore focuses on local semantic similarity and does not guarantee an understanding of text structure or factual accuracy. A text with rearranged phrases or factual errors can still receive a high score.[3]
  • Low Interpretability: Unlike BLEU/ROUGE, the BERTScore metric is less transparent, making error analysis more difficult.
  • Social Biases: The metric inherits the stereotypes and biases embedded in the pre-trained models. A 2022 study showed that LLM-based metrics (including BERTScore) exhibit significantly more social bias than traditional metrics.[5]

Significance and Role in Evaluation

BERTScore represents an important step in the evolution of text generation evaluation methods, as it allows for the consideration of semantic equivalence rather than just lexical overlap. Although no single automatic metric can perfectly measure text quality, BERTScore has established itself as a reliable tool that complements, rather than completely replaces, classic approaches like BLEU and ROUGE.

In practice, BERTScore is often used in combination with manual expert evaluation and other metrics to gain a more complete and in-depth understanding of how successfully models generate coherent and semantically appropriate texts.[2]

References

  1. 1.0 1.1 Zhang, Tianyi, et al. "BERTScore: Evaluating Text Generation with BERT." arXiv:1904.09675 [cs.CL], 22 Apr. 2019. [1]
  2. 2.0 2.1 2.2 "BERTScore: New Metrics for Language Models." Analytics Vidhya. [2]
  3. 3.0 3.1 3.2 Sojasingarayar, Abonia. "BERTScore Explained in 5 minutes." Medium. [3]
  4. Alakulju, D., et al. "Reproducibility of BERTScore." Theseus.fi. [4]
  5. Peyrard, M., et al. "BERTScore is Unfair: On Social Bias in Language Model-Based Metrics for Text Generation." arXiv:2210.07626 [cs.CL], 14 Oct. 2022. [5]