LMArena (Chatbot Arena) — 大模型竞技场

LMArena（Large Model Arena，前身为Chatbot Arena）是一个开放的网络平台，通过众包方式，根据人类偏好对大型模型（LLM 和多模态模型）进行评估和比较，其特色是匿名的配对比较和公开的排行榜^[1]^[2]。

该平台源于一项名为 LMSYS 的研究计划（由加州大学伯克利分校/卡内基梅隆大学/加州大学圣地亚哥分校联合发起）[41]，于2024年9月“毕业”并推出了独立网站 lmarena.ai^[3]。2025年5月，它正式注册为公司，并获得了由 a16z、UC Investments 等投资的 1亿美元 种子轮融资，用于发展其开放评估基础设施^[4]^[5]。

历史

该平台于2023年5月以 Chatbot Arena 的名称启动。2025年春季，平台正式更名为 LMArena (Large Model Arena)，并转型为一个独立组织。

2023年5月3日 — Chatbot Arena 启动，发布了第一个基于匿名“对战”的排行榜^[6]。
2023年 — 发布数据集：33K 配对对话（7月）和 LMSYS-Chat-1M（9月，100万个真实对话）^[7]^[8]。
2024年9月20日 — “毕业”：迁移至独立域名 lmarena.ai^[3]。
2024–2025年 — 扩展评估方法和竞技场类型（Arena-Hard, Style/Sentiment Control, WebDev/RepoChat 等）^[9]^[10]^[11]^[12]。
2025年4月27日 — 累计获得超过 300万 张投票，评估了超过 400 个公开模型和 300 多个非公开预览模型^[13]。
2025年5月21日 — LMArena 宣布成立公司并获得1亿美元种子轮融资^[4]^[5]。
2025年7月31日 — 发布包含 14万 条近期来自 Text Arena 对话的开放数据集^[14]。
2025年8月26–27日 — 以代号“nano-banana”匿名测试 Gemini 2.5 Flash Image；该模型随后在 Text-to-Image/Image Edit 排行榜上登顶^[15]^[16]。
2025年8月28日 — 将 Microsoft MAI-1-preview 添加到文本排行榜（参见 Changelog）^[17]。
现状：Text Arena 标签页显示已有 4,075,191 张投票（更新于 2025年9月8日）^[18]。

评估如何运作

用户输入一个提示词，会从两个随机选择的匿名模型（“A”和“B”）获得两个回答，然后用户投票选出更好的一个（或判定为平局/两者都不好）。排名基于布拉德利-特里模型（一种处理配对偏好的逻辑回归模型），该模型在直观上类似于埃洛等级分制度^[1]。平台公布 Arena Score 和置信区间，并采用样本重加权（re-weighting）校正，以在不均匀采样时保持评估的无偏性^[19]。

透明度与开放性。 评估和排名的原始流程在 FastChat 代码库中是开源的^[20]；平台会定期发布部分原始数据用于验证和研究（例如，2025年7月发布了14万条对话数据）^[19]^[14]。根据 FAQ 和主页上的警告，用户提示词可能会被透露给模型提供商，并部分公开用于研究目的——因此不应提交敏感数据^[21]^[22]。

选择和抽样规则。 排行榜收录公开发布的模型（开放权重/公共 API/公共服务）。为稳定评分，通常需要≥1000张投票；至少20%的对战仅在公开模型之间进行；模型的抽样概率随其评分和不确定性的增加而提高，而重加权回归确保了最终评分的无偏性^[19]。

自动指标和风格控制。 为加速评估并减少“风格”偏好的影响，平台采用了辅助方法：MT-Bench (LLM-as-a-judge)^[23]、Arena-Hard（自动生成复杂问题）^[9]，以及风格/情感控制（Style/Sentiment Control，建模并“修正”语气/情感对偏好的影响）^[10]。对于 Arena-Hard-Auto，据报道，在受控条件下，其结果与人类“真实”投票的一致性非常高（高达约98.6%）^[24]。

竞技场和评估领域

该平台已发展为一系列针对不同任务类型的“竞技场”：

Text Arena — 通用对话/任务，主排行榜^[18]。
Vision Arena — “文本→图像/视频/图像分析”的多模态模型^[25]。
Text-to-Image 和 Image Edit — 图像生成和编辑（包括 nano-banana 案例）^[16]^[15]。
Text-/Image-to-Video — 视频生成^[26]。
WebDev Arena — 根据描述构建网络应用^[11]。
RepoChat Arena — 围绕代码/代码库的 AI 软件工程任务^[12]。
Search Arena — 集成了网络搜索功能的模型；最初于2025年4月启动（旧版），后迁移至主站，并附有数据集和相关出版物^[27]^[28]^[29]。
BiomedArena.AI — 针对生物医学任务的领域特定评估（与 DataTecnica 合作）^[30]。

应用与影响

行业展示窗口。 各大供应商（OpenAI, Anthropic, Google 等）定期在 LMArena 上测试和展示其模型；行业媒体将该平台描述为一个重要的参考基准^[5]^[31]。在 NAACL-2025 的一篇行业论文中，Chatbot Arena 的埃洛评分被誉为“行业黄金标准”（gold industry-standard）^[32]。
发布前测试。 其政策允许对“未发布”模型进行匿名预览，并在发布后向社区通报并公布公开评估结果；为稳定评分，至少需要约1000张投票^[19]。
知名事件。 2025年春季，关于匿名模型 Llama-4 Maverick-03-26-Experimental 的讨论（围绕其与公开发布版本的比较引发了争议）吸引了媒体的广泛关注，并促使平台更新了规则和沟通方式^[33]^[34]。2025年8月，“nano-banana”被揭晓为 Gemini 2.5 Flash Image，并在视觉竞技场中占据了领先地位^[15]^[16]。

局限性与批评

尽管该平台规模庞大且广受欢迎，但其方法也存在局限性：

主观性和风格效应。 投票偏好取决于回答的语气和风格；团队正在引入风格/情感控制（Style/Sentiment Control）以解耦“风格”与“内容”^[10]。
受众不具代表性。 核心活跃用户是技术爱好者和开发者；为适应特定领域场景，平台创建了专门的竞技场（Search, WebDev, Biomed 等）^[35]。
易受操纵和偏见影响。 2025年的研究表明，在没有严格防护措施的情况下，存在通过数百至数千张投票进行“刷票”的策略；但研究人员与 LMArena 的合作促使平台引入了保护措施（验证码/登录/机器人防护/异常检测），从而增加了“攻击成本”^[36]^[37]^[38]。
方法论上的批评。 论文 The Leaderboard Illusion（2025年4月）指出了可能扭曲竞争环境的系统性和制度性因素；LMArena 对此发表了详细回应，并维护一个公开的方法论 changelog^[39]^[40]^[17]。

链接

参考文献

Chiang, W.-L. et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685.
Li, T. et al. (2024). From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. arXiv:2406.11939.
Ameli, S.; Zhuang, S.; Stoica, I.; Mahoney, M. W. (2024). A Statistical Framework for Ranking LLM-Based Chatbots. arXiv:2412.18407.
Boubdir, M. et al. (2023). Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. arXiv:2311.17295.
Huang, J. Y.; Shen, Y.; Wei, D.; Broderick, T. (2025). Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings. arXiv:2508.11847.
Xu, Y.; Ruis, L.; Rocktäschel, T.; Kirk, R. (2025). Investigating Non-Transitivity in LLM-as-a-Judge. arXiv:2502.14074.
Li, H. et al. (2024). LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv:2412.05579.
Zheng, L. et al. (2024). LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset. arXiv:2309.11998.
Dubois, Y. et al. (2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. arXiv:2404.04475.
Singh, S. et al. (2025). The Leaderboard Illusion. arXiv:2504.20879.
Min, R.; Pang, T.; Du, C.; Liu, Q.; Cheng, M.; Lin, M. (2025). Improving Your Model Ranking on Chatbot Arena by Vote Rigging. arXiv:2501.17858.

注释

↑ ^1.0 ^1.1 Chiang, W.-L. et al. «Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference». arXiv:2403.04132, 2024. arXiv
↑ «Hello from LMArena: The Community Platform for Exploring Frontier AI». LMArena Blog, 2025年6月23日. [1]
↑ ^3.0 ^3.1 «Announcing a New Site for Chatbot Arena». LMSYS Blog, 2024年9月20日. [2]
↑ ^4.0 ^4.1 «LMArena Secures $100M in Seed Funding…». PR Newswire, 2025年5月21日. [3]
↑ ^5.0 ^5.1 ^5.2 Wiggers, K. «LM Arena… lands $100M». TechCrunch, 2025年5月21日. [4]
↑ «Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings». LMSYS Blog, 2023年5月3日. [5]
↑ «Chatbot Arena Conversation Dataset Release». LMSYS Blog, 2023年7月20日. [6]
↑ Zheng, L. et al. «LMSYS‑Chat‑1M». arXiv:2309.11998, 2023. [7]
↑ ^9.0 ^9.1 Li, T. et al. «From Crowdsourced Data to High‑Quality Benchmarks: Arena‑Hard and BenchBuilder Pipeline». arXiv:2406.11939, 2024. [8]
↑ ^10.0 ^10.1 ^10.2 «Does Sentiment Matter Too? Introducing Sentiment Control». LMArena Blog, 2025年4月22日. [9]
↑ ^11.0 ^11.1 «WebDev Arena: A Live LLM Leaderboard for Web App Development». LMArena Blog, 2025年3月10日. [10]
↑ ^12.0 ^12.1 «RepoChat Arena: A Live Benchmark for AI Software Engineers». LMArena Blog, 2025年4月9日. [11]
↑ «Celebrating Community Impact: 3M+ votes, 400+ models…». LMArena Blog, 2025年4月27日. [12]
↑ ^14.0 ^14.1 Y. Song. «A Deep Dive into Recent Arena Data». LMArena Blog, 2025年7月31日. [13]
↑ ^15.0 ^15.1 ^15.2 «Nano Banana (Gemini 2.5 Flash Image): Try it on LMArena». LMArena Blog, 2025年8月27日. [14]
↑ ^16.0 ^16.1 ^16.2 Text‑to‑Image Arena. LMArena, 更新于2025年8月25日. [15]
↑ ^17.0 ^17.1 Leaderboard Changelog. LMArena Blog, 2025年8月记录. [16]
↑ ^18.0 ^18.1 Text Arena (English). LMArena. [17]
↑ ^19.0 ^19.1 ^19.2 ^19.3 LMArena Leaderboard Policy. LMArena Blog, 修订于2025年9月8日. [18]
↑ lm‑sys/FastChat (GitHub). [19]
↑ FAQ. LMArena. [20]
↑ LMArena 主页（关于数据可能被公开并传输给提供商的免责声明）. [21]
↑ Zheng, L. et al. «Judging LLM‑as‑a‑Judge with MT‑Bench and Chatbot Arena». arXiv:2306.05685, 2023. [22]
↑ Li, T. et al. «From Crowdsourced Data…» arXiv:2406.11939 (一致性表格). [23]
↑ Vision Arena. LMArena, 更新于2025年9月2日. [24]
↑ Text‑to‑Video 和 Image‑to‑Video Leaderboards. LMArena, 2025年8月. [25] [26]
↑ «Introducing the Search Arena». LMArena Blog, 2025年4月14日. [27]
↑ «Search Arena & What We’re Learning About Human Preference». LMArena Blog, 2025年7月23日. [28]
↑ Frick, E. et al. «Search Arena: Analyzing Search‑Augmented LLMs». arXiv:2506.05334, 2025. [29]
↑ «Introducing BiomedArena.AI». LMArena Blog, 2025年8月19日. [30]
↑ Google. «Gemma 3…», 2025年3月12日 (引用 LMArena 结果). [31]
↑ Spangher, L. et al. «Chatbot Arena Estimate…». NAACL Industry, 2025. [32]
↑ «Meta’s experimental Llama 4 model briefly topped AI leaderboard…». The Register, 2025年4月7日. [33]
↑ LMArena 在 X 上关于此事件的官方澄清/帖子（2025年4月）. [34]
↑ «Search Arena & What We’re Learning…». LMArena Blog, 2025年7月23日. [35]
↑ Min, R. et al. «Improving Your Model Ranking on Chatbot Arena by Vote Rigging». arXiv:2501.17858, 2025. [36]
↑ Huang, Y. et al. «Exploring and Mitigating Adversarial Manipulation of Voting‑Based Leaderboards». arXiv:2501.07493, 2025. [37]
↑ «Hundreds of rigged votes can skew…». Fast Company, 2025年2月6日. [38]
↑ Singh, S. et al. «The Leaderboard Illusion». arXiv:2504.20879, 2025. [39]
↑ «Our Response to ‘The Leaderboard Illusion’». LMArena Blog, 2025年5月9日. [40]

[chiang2024-1] 1.0 ^1.1 Chiang, W.-L. et al. «Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference». arXiv:2403.04132, 2024. arXiv

[hello_2025-2] «Hello from LMArena: The Community Platform for Exploring Frontier AI». LMArena Blog, 2025年6月23日. [1]

[new_site_2024-3] 3.0 ^3.1 «Announcing a New Site for Chatbot Arena». LMSYS Blog, 2024年9月20日. [2]

[seed_prn-4] 4.0 ^4.1 «LMArena Secures $100M in Seed Funding…». PR Newswire, 2025年5月21日. [3]

[tc_seed-5] 5.0 ^5.1 ^5.2 Wiggers, K. «LM Arena… lands $100M». TechCrunch, 2025年5月21日. [4]

[lmsys_launch_2023-6] «Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings». LMSYS Blog, 2023年5月3日. [5]

[7] «Chatbot Arena Conversation Dataset Release». LMSYS Blog, 2023年7月20日. [6]

[8] Zheng, L. et al. «LMSYS‑Chat‑1M». arXiv:2309.11998, 2023. [7]

[arena_hard-9] 9.0 ^9.1 Li, T. et al. «From Crowdsourced Data to High‑Quality Benchmarks: Arena‑Hard and BenchBuilder Pipeline». arXiv:2406.11939, 2024. [8]

[sentiment_control-10] 10.0 ^10.1 ^10.2 «Does Sentiment Matter Too? Introducing Sentiment Control». LMArena Blog, 2025年4月22日. [9]

[webdev_arena-11] 11.0 ^11.1 «WebDev Arena: A Live LLM Leaderboard for Web App Development». LMArena Blog, 2025年3月10日. [10]

[repochat_arena-12] 12.0 ^12.1 «RepoChat Arena: A Live Benchmark for AI Software Engineers». LMArena Blog, 2025年4月9日. [11]

[3m_2025-13] «Celebrating Community Impact: 3M+ votes, 400+ models…». LMArena Blog, 2025年4月27日. [12]

[opendata_2025-14] 14.0 ^14.1 Y. Song. «A Deep Dive into Recent Arena Data». LMArena Blog, 2025年7月31日. [13]

[nanobanana_blog-15] 15.0 ^15.1 ^15.2 «Nano Banana (Gemini 2.5 Flash Image): Try it on LMArena». LMArena Blog, 2025年8月27日. [14]

[tti_page-16] 16.0 ^16.1 ^16.2 Text‑to‑Image Arena. LMArena, 更新于2025年8月25日. [15]

[changelog-17] 17.0 ^17.1 Leaderboard Changelog. LMArena Blog, 2025年8月记录. [16]

[text_stats-18] 18.0 ^18.1 Text Arena (English). LMArena. [17]

[policy-19] 19.0 ^19.1 ^19.2 ^19.3 LMArena Leaderboard Policy. LMArena Blog, 修订于2025年9月8日. [18]

[20] ‑sys/FastChat (GitHub). [19]

[21] FAQ. LMArena. [20]

[22] LMArena 主页（关于数据可能被公开并传输给提供商的免责声明）. [21]

[23] Zheng, L. et al. «Judging LLM‑as‑a‑Judge with MT‑Bench and Chatbot Arena». arXiv:2306.05685, 2023. [22]

[24] Li, T. et al. «From Crowdsourced Data…» arXiv:2406.11939 (一致性表格). [23]

[25] Vision Arena. LMArena, 更新于2025年9月2日. [24]

[26] Text‑to‑Video 和 Image‑to‑Video Leaderboards. LMArena, 2025年8月. [25] [26]

[27] «Introducing the Search Arena». LMArena Blog, 2025年4月14日. [27]

[28] «Search Arena & What We’re Learning About Human Preference». LMArena Blog, 2025年7月23日. [28]

[29] Frick, E. et al. «Search Arena: Analyzing Search‑Augmented LLMs». arXiv:2506.05334, 2025. [29]

[30] «Introducing BiomedArena.AI». LMArena Blog, 2025年8月19日. [30]

[31] Google. «Gemma 3…», 2025年3月12日 (引用 LMArena 结果). [31]

[32] Spangher, L. et al. «Chatbot Arena Estimate…». NAACL Industry, 2025. [32]

[33] «Meta’s experimental Llama 4 model briefly topped AI leaderboard…». The Register, 2025年4月7日. [33]

[34] LMArena 在 X 上关于此事件的官方澄清/帖子（2025年4月）. [34]

[35] «Search Arena & What We’re Learning…». LMArena Blog, 2025年7月23日. [35]

[36] Min, R. et al. «Improving Your Model Ranking on Chatbot Arena by Vote Rigging». arXiv:2501.17858, 2025. [36]

[37] Huang, Y. et al. «Exploring and Mitigating Adversarial Manipulation of Voting‑Based Leaderboards». arXiv:2501.07493, 2025. [37]

[38] «Hundreds of rigged votes can skew…». Fast Company, 2025年2月6日. [38]

[39] Singh, S. et al. «The Leaderboard Illusion». arXiv:2504.20879, 2025. [39]

[40] «Our Response to ‘The Leaderboard Illusion’». LMArena Blog, 2025年5月9日. [40]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

LMArena (Chatbot Arena) — 大模型竞技场

Contents

历史

评估如何运作

竞技场和评估领域

应用与影响

局限性与批评

链接

参考文献

注释

Navigation menu

LMArena (Chatbot Arena) — 大模型竞技场

历史

评估如何运作

竞技场和评估领域

应用与影响

局限性与批评

链接

参考文献

注释

Navigation menu

Search