MT-Bench (benchmark)
'MT-Bench (short for Multi-Turn Benchmark) is a benchmark test suite (benchmark) for evaluating large language models (LLMs) in multi-turn conversational settings. The benchmark was proposed in 2023 by a team of researchers from LMSYS (led by Lianmin Zheng) as part of the LLM-as-a-Judge method for objectively comparing the quality of chatbots[1].
Unlike traditional single-turn tests (such as MMLU), MT-Bench evaluates a model's ability to conduct a multi-step conversation, sequentially process new inputs, and accurately follow user instructions. The goal is to provide a more realistic evaluation of chatbot performance in complex scenarios, focusing on alignment with human preferences and the practical requirements of conversational systems[2].
Rationale
The development of conversational LLMs like ChatGPT, GPT-4, and Vicuna revealed a gap between traditional quality metrics and actual user perception of their responses. It became clear that improving a model's alignment with human instructions (through RLHF) did not always lead to higher scores on older, single-turn benchmarks. Tests like MMLU or HELM often failed to distinguish between improved ("aligned") chatbots and their base models. This highlighted the limitations of previous methods, which did not capture the quality of multi-turn interactions and open-ended instructions.
MT-Bench was created in response to this problem, offering a set of open-ended, dialogue-based questions that focus on two aspects: 1. The model's ability to maintain a coherent conversation over several steps (turns). 2. The model's ability to accurately follow complex user instructions[1].
Benchmark Structure and Content
MT-Bench consists of 80 carefully selected multi-turn conversational scenarios covering various types of tasks. Each scenario includes a series of exchanges between the user and the model, testing the model's ability to maintain context and adapt to new inputs. The dialogues are grouped into 8 task categories:
- Writing — testing creative skills (e.g., writing a blog post).
- Roleplay — simulating dialogues in specific roles.
- Extraction — the ability to extract facts from a given context.
- Reasoning — solving logical thinking problems.
- Math — solving mathematical problems.
- Coding — writing or debugging code.
- STEM — questions from science, technology, engineering, and mathematics.
- Humanities — questions related to history, literature, and social sciences.
Each category contains 10 dialogue tasks. The tasks intentionally include tricky follow-ups (such as sudden clarifying questions) to test the model in a quasi-"real-world" conversation[3].
Evaluation Method: LLM-as-a-Judge
A key feature of MT-Bench is the use of a powerful language model as a judge for automated response evaluation (LLM-as-a-Judge). In the original paper, the GPT-4 model was used for this role[1].
The evaluation procedure is structured as follows: 1. For each dialogue scenario, several participating models generate responses. 2. The judge model (GPT-4) compares these responses (either through pairwise comparison or a point-based scale) and determines the preferred one.
This automated judging replaces laborious manual annotation. The researchers demonstrated that GPT-4's evaluations as a judge have an agreement rate of over 80% with those of human experts, which is comparable to the inter-annotator agreement among humans themselves. This indicates the method's reliability and its potential for scaling evaluations without direct human involvement. To improve objectivity, potential biases of the judge model were considered and mitigated, such as positional bias (preferring the first response), verboseness bias (preferring longer responses), and self-enhancement bias (favoring responses in its own style)[1].
Results and Application
MT-Bench revealed significant differences in the capabilities of modern models. In the reasoning, math, and coding categories, GPT-4 significantly outperformed previous versions (such as GPT-3.5). This confirmed that larger models are better at maintaining context over multiple dialogue turns.
For practical application of the results, the LMSYS team launched a public leaderboard where models are ranked by their average MT-Bench score and their Elo rating from the Chatbot Arena. This leaderboard is regularly updated to reflect progress in the industry. The dataset itself and the code for running it have been made publicly available, allowing independent developers to test their own models[2].
Limitations and Criticism
Despite its successful application, MT-Bench and the LLM-as-a-Judge approach have several limitations:
- Imperfect Judge. The judge model (e.g., GPT-4) is not infallible: it cannot always recognize factual errors or hallucinations in the responses of the models being tested.
- Difficulty with Logic and Math. An LLM judge may not be able to fully follow a complex line of reasoning or verify a proof, which can lead to evaluation errors.
- Biases. Despite mitigation efforts, the judge model may retain a bias toward a particular style or format of response.
These aspects mean that human oversight or combined evaluation methods are still desirable for mission-critical applications.
Development and Extensions
The success of MT-Bench has spurred the development of extended versions. In 2024, MT-Bench-101 was proposed, a method aimed at an even more detailed analysis of models' conversational abilities. The authors created a three-level taxonomy of skills and compiled a much larger dataset, which allowed them to identify subtle differences in model behavior at various stages of a dialogue[4].
Links
Further Reading
- Liang, P. et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
- Chang, Y. et al. (2023). A Survey on Evaluation of Large Language Models. arXiv:2307.03109.
- Ni, S. et al. (2025). A Survey on Large Language Model Benchmarks. arXiv:2508.15361.
- Biderman, S. et al. (2024). The Language Model Evaluation Harness (lm-eval): Guidance and Lessons Learned. arXiv:2405.14782.
- Kiela, D. et al. (2021). Dynabench: Rethinking Benchmarking in NLP. arXiv:2104.14337.
- Ma, Z. et al. (2021). Dynaboard: An Evaluation‑As‑A‑Service Platform for Holistic Next‑Generation Benchmarking. arXiv:2106.06052.
- Goel, K. et al. (2021). Robustness Gym: Unifying the NLP Evaluation Landscape. arXiv:2101.04840.
- Xu, C. et al. (2024). Benchmark Data Contamination of Large Language Models: A Survey. arXiv:2406.04244.
- Liu, S. et al. (2025). A Comprehensive Survey on Safety Evaluation of LLMs. arXiv:2506.11094.
- Chiang, W.-L. et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
- Boubdir, M. et al. (2023). Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. arXiv:2311.17295.
- Huang, L. et al. (2023). A Survey on Hallucination in Large Language Models. arXiv:2311.05232.
Notes
- ↑ 1.0 1.1 1.2 1.3 Zheng, L. et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv:2306.05685, 2023. [1]
- ↑ 2.0 2.1 "MT-Bench (Multi-turn Benchmark)." Klu.ai Glossary. [2]
- ↑ "MT-Bench - GM-RKB." GaborMelli.com. [3]
- ↑ Bai, G. et al. "MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues." arXiv:2402.14762, 2024. [4]