METEOR (metric)
METEOR is a name used in the field of natural language processing (NLP) for several related but distinct concepts. Primarily, it is a well-known automatic metric for evaluating the quality of machine translation. Additionally, in 2024, two independent research projects related to large language models (LLMs) were introduced under the same name: an evolutionary training method and a multimodal language model.
METEOR as a machine translation evaluation metric
METEOR (an acronym for Metric for Evaluation of Translation with Explicit ORdering) is an automatic metric for evaluating the quality of machine translation, proposed in 2005 by Carnegie Mellon University researchers Satanjeev Banerjee and Alon Lavie[1]. Its goal was to improve the correlation of automatic evaluations with human judgments, especially at the sentence level, by addressing some of the shortcomings of the earlier BLEU metric.
Key features of the METEOR metric:
- Considers both precision and recall: Unlike BLEU, which focuses only on precision, METEOR calculates the harmonic mean of precision and recall, which penalizes translations for omitting important words.
- Flexible word matching: METEOR uses linguistic features to match the translation with the reference. It considers not only exact matches but also different word forms (through stemming) and synonyms (using WordNet).
- Penalty for incorrect word order: The metric includes a penalty that punishes incorrect word order in the candidate translation, even if all the words match the reference.
These improvements allow the METEOR metric to correlate significantly better with human judgments compared to BLEU[2]. The metric is widely used in research on machine translation, automatic summarization, and image captioning evaluation[3].
METEOR as an evolutionary training method for LLMs
In 2024, a group of Chinese researchers introduced a method called METEOR: Evolutionary Journey of Large Language Models from Guidance to Self-Growth[4]. This method is designed for efficiently training LLMs that specialize in narrow subject domains (e.g., finance, medicine) without needing to train the model from scratch.
The authors describe a three-phase "evolution" scheme for the LLM:
- Weak-to-strong data distillation: A more powerful "teacher" model (e.g., GPT-4) is used to generate the training corpus. The domain-specific model first generates a solution plan, and the stronger model creates the answer following this plan. This aligns the knowledge distribution and allows the target model to absorb it more effectively.
- Guided iterative training: The model trained in the first phase solves tasks independently, while the strong model acts as a "referee," evaluating the answers and pointing out errors. This reflective cycle develops the domain model's ability for self-correction.
- Self-evolution: The model continues to improve without an external supervisor, using its accumulated skills to generate and correct new data.
This method offers a practical approach to creating compact and cost-effective LLM experts for specific industries[5].
METEOR as a multimodal LLM
Also in 2024, a team of researchers from KAIST introduced a large multimodal language model named METEOR: Mamba-based Traversal of Rationales[6]. The model is designed for comprehensive understanding of visual information and generating answers to visual questions.
A key feature of METEOR is its use of detailed rationales. The model does not just provide a final answer; it generates and relies on a hidden "chain of thought"—a sequential explanation of how to arrive at the answer, similar to how a human would reason.
The METEOR architecture employs a special module based on the Mamba model—an efficient architecture for processing very long sequences. This module encodes long chains of reasoning, which can include descriptions of objects in an image, their spatial relationships, and the steps required to solve the task[7].
The model was successfully tested on complex multimodal benchmarks such as MME, AI2D (diagram understanding), and MathVista (solving mathematical problems in a visual context). It demonstrated high performance without requiring additional external computer vision modules, indicating efficient use of its own parameters[7].
References
- ↑ Banerjee, S., and A. Lavie. «METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments». ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT, 2005. [1]
- ↑ Lavie, A., and A. Agarwal. «METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments». ACL Workshop on Statistical Machine Translation, 2007. [2]
- ↑ «Evaluating Large Language Models: Powerful Insights Ahead». DataScienceDojo. [3]
- ↑ Li, J., X. Xu, and Y. Gao. «METEOR: Evolutionary Journey of Large Language Models from Guidance to Self-Growth». arXiv preprint arXiv:2411.11933, 2024. [4]
- ↑ Li, J., X. Xu, and Y. Gao. «METEOR: Evolutionary Journey of Large Language Models from Guidance to Self-Growth». ar5iv.org. [5]
- ↑ Lee, B.-K., et al. «Meteor: Mamba-based Traversal of Rationales for Large Language and Vision Models». NeurIPS 2024 (poster). [6]
- ↑ 7.0 7.1 Lee, B.-K., et al. «Meteor: Mamba-based Traversal of Rationales for Large Language and Vision Models». arXiv preprint arXiv:2405.15574, 2024. [7]