MT-Bench (benchmark)

'MT-Bench (short for Multi-Turn Benchmark) is a benchmark test suite (benchmark) for evaluating large language models (LLMs) in multi-turn conversational settings. The benchmark was proposed in 2023 by a team of researchers from LMSYS (led by Lianmin Zheng) as part of the LLM-as-a-Judge method for objectively comparing the quality of chatbots^[1].

Unlike traditional single-turn tests (such as MMLU), MT-Bench evaluates a model's ability to conduct a multi-step conversation, sequentially process new inputs, and accurately follow user instructions. The goal is to provide a more realistic evaluation of chatbot performance in complex scenarios, focusing on alignment with human preferences and the practical requirements of conversational systems^[2].

Rationale

The development of conversational LLMs like ChatGPT, GPT-4, and Vicuna revealed a gap between traditional quality metrics and actual user perception of their responses. It became clear that improving a model's alignment with human instructions (through RLHF) did not always lead to higher scores on older, single-turn benchmarks. Tests like MMLU or HELM often failed to distinguish between improved ("aligned") chatbots and their base models. This highlighted the limitations of previous methods, which did not capture the quality of multi-turn interactions and open-ended instructions.

MT-Bench was created in response to this problem, offering a set of open-ended, dialogue-based questions that focus on two aspects: 1. The model's ability to maintain a coherent conversation over several steps (turns). 2. The model's ability to accurately follow complex user instructions^[1].

Benchmark Structure and Content

MT-Bench consists of 80 carefully selected multi-turn conversational scenarios covering various types of tasks. Each scenario includes a series of exchanges between the user and the model, testing the model's ability to maintain context and adapt to new inputs. The dialogues are grouped into 8 task categories:

Writing — testing creative skills (e.g., writing a blog post).
Roleplay — simulating dialogues in specific roles.
Extraction — the ability to extract facts from a given context.
Reasoning — solving logical thinking problems.
Math — solving mathematical problems.
Coding — writing or debugging code.
STEM — questions from science, technology, engineering, and mathematics.
Humanities — questions related to history, literature, and social sciences.

Each category contains 10 dialogue tasks. The tasks intentionally include tricky follow-ups (such as sudden clarifying questions) to test the model in a quasi-"real-world" conversation^[3].

Evaluation Method: LLM-as-a-Judge

A key feature of MT-Bench is the use of a powerful language model as a judge for automated response evaluation (LLM-as-a-Judge). In the original paper, the GPT-4 model was used for this role^[1].

The evaluation procedure is structured as follows: 1. For each dialogue scenario, several participating models generate responses. 2. The judge model (GPT-4) compares these responses (either through pairwise comparison or a point-based scale) and determines the preferred one.

This automated judging replaces laborious manual annotation. The researchers demonstrated that GPT-4's evaluations as a judge have an agreement rate of over 80% with those of human experts, which is comparable to the inter-annotator agreement among humans themselves. This indicates the method's reliability and its potential for scaling evaluations without direct human involvement. To improve objectivity, potential biases of the judge model were considered and mitigated, such as positional bias (preferring the first response), verboseness bias (preferring longer responses), and self-enhancement bias (favoring responses in its own style)^[1].

Results and Application

MT-Bench revealed significant differences in the capabilities of modern models. In the reasoning, math, and coding categories, GPT-4 significantly outperformed previous versions (such as GPT-3.5). This confirmed that larger models are better at maintaining context over multiple dialogue turns.

For practical application of the results, the LMSYS team launched a public leaderboard where models are ranked by their average MT-Bench score and their Elo rating from the Chatbot Arena. This leaderboard is regularly updated to reflect progress in the industry. The dataset itself and the code for running it have been made publicly available, allowing independent developers to test their own models^[2].

Limitations and Criticism

Despite its successful application, MT-Bench and the LLM-as-a-Judge approach have several limitations:

Imperfect Judge. The judge model (e.g., GPT-4) is not infallible: it cannot always recognize factual errors or hallucinations in the responses of the models being tested.
Difficulty with Logic and Math. An LLM judge may not be able to fully follow a complex line of reasoning or verify a proof, which can lead to evaluation errors.
Biases. Despite mitigation efforts, the judge model may retain a bias toward a particular style or format of response.

These aspects mean that human oversight or combined evaluation methods are still desirable for mission-critical applications.

Development and Extensions

The success of MT-Bench has spurred the development of extended versions. In 2024, MT-Bench-101 was proposed, a method aimed at an even more detailed analysis of models' conversational abilities. The authors created a three-level taxonomy of skills and compiled a much larger dataset, which allowed them to identify subtle differences in model behavior at various stages of a dialogue^[4].

Links

Official repository with MT-Bench data on GitHub

Notes

↑ ^1.0 ^1.1 ^1.2 ^1.3 Zheng, L. et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv:2306.05685, 2023. [1]
↑ ^2.0 ^2.1 "MT-Bench (Multi-turn Benchmark)." Klu.ai Glossary. [2]
↑ "MT-Bench - GM-RKB." GaborMelli.com. [3]
↑ Bai, G. et al. "MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues." arXiv:2402.14762, 2024. [4]

[mt_bench_paper-1] 1.0 ^1.1 ^1.2 ^1.3 Zheng, L. et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv:2306.05685, 2023. [1]

[klu_glossary-2] 2.0 ^2.1 "MT-Bench (Multi-turn Benchmark)." Klu.ai Glossary. [2]

[gabor_melli_rkb-3] "MT-Bench - GM-RKB." GaborMelli.com. [3]

[mt_bench_101_paper-4] Bai, G. et al. "MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues." arXiv:2402.14762, 2024. [4]

[1]

[2]

[3]

[4]

MT-Bench (benchmark)

Contents

Rationale

Benchmark Structure and Content

Evaluation Method: LLM-as-a-Judge

Results and Application

Limitations and Criticism

Development and Extensions

Links

Further Reading

Notes

Navigation menu

MT-Bench (benchmark)

Rationale

Benchmark Structure and Content

Evaluation Method: LLM-as-a-Judge

Results and Application

Limitations and Criticism

Development and Extensions

Links

Further Reading

Notes

Navigation menu

Search