LMArena (Chatbot Arena)
LMArena (Large Model Arena, formerly known as Chatbot Arena) is an open web platform for the crowdsourced evaluation and comparison of large models (LLMs and multimodal models) based on human preferences, featuring anonymous pairwise comparisons and public leaderboards[1][2].
The platform grew out of the LMSYS (UC Berkeley/CMU/UC San Diego)[41] research initiative. In September 2024, it "graduated" to its own website, lmarena.ai[3], and in May 2024, it was established as a company, raising $100M in seed funding (from a16z, UC Investments, and others) to develop its open evaluation infrastructure[4][5].
History
The platform was launched in May 2023 under the name Chatbot Arena. In the spring of 2024, it was officially renamed LMArena (Large Model Arena) and established as an independent organization.
- May 3, 2023 — Chatbot Arena launches, the first leaderboard based on anonymous "battles"[6].
- 2023 — Dataset releases: 33K pairwise conversations (July) and LMSYS-Chat-1M (September, 1 million real-world conversations)[7][8].
- September 20, 2024 — "Graduation": moved to its own domain, lmarena.ai[3].
- 2024 — Expansion of methodology and arenas (Arena-Hard, Style/Sentiment Control, WebDev/RepoChat, etc.)[9][10][11][12].
- April 27, 2024 — A total of 3+ million votes, 400+ public models, and 300+ private previews collected[13].
- May 21, 2024 — LMArena announces its incorporation and a $100M seed round[4][5].
- July 31, 2024 — Release of an open dataset of 140k recent conversations from the Text Arena[14].
- August 26–27, 2024 — Anonymous testing of Gemini 2.5 Flash Image under the codename "nano-banana"; the model subsequently topped the Text-to-Image/Image Edit leaderboards[15][16].
- August 28, 2024 — Microsoft MAI-1-preview is added to the text leaderboard (see Changelog)[17].
- Status: The Text Arena tab reports 4,075,191 votes (updated September 8, 2024)[18].
How Evaluation Works
A user enters a prompt and receives two responses from randomly selected anonymous models ("A" and "B"), then votes for the better response (or declares a tie/unsatisfactory result). The ranking is based on the Bradley–Terry statistical model (logistic regression on pairwise preferences), which is intuitively similar to the Elo rating system[1]. The platform publishes an Arena Score and confidence intervals, and also applies sample corrections (re-weighting) to maintain unbiasedness despite non-uniform sampling[19].
Transparency and Openness. The source pipelines for evaluation and ranking are open-sourced in the FastChat repository[20]; parts of the raw data are periodically published for verification and research (e.g., the 140K conversation release in July 2024)[19][14]. According to the FAQ and disclaimers on the main page, user prompts may be disclosed to model providers and partially published for research purposes—sensitive data should not be submitted[21][22].
Selection and Sampling Rules. Leaderboards include publicly accessible models (open weights/public API/public service). Typically, ≥1000 votes are required to stabilize a rating; at least 20% of battles are between public models only; sampling probability increases with rating and uncertainty, and re-weighted regression ensures the final scores are unbiased[19].
Auto-Metrics and Style Control. To accelerate evaluation and reduce the effects of "stylistic" preferences, auxiliary methods are used: MT-Bench (LLM-as-a-judge)[23], Arena-Hard (auto-generation of difficult questions)[9], and Style/Sentiment Control (modeling and correcting for the effect of tone/emotion on preferences)[10]. For Arena-Hard-Auto, a very high agreement with "live" human votes was reported (up to ≈98.6% in controlled conditions)[24].
Arenas and Evaluation Domains
The platform has evolved into a set of "arenas" for different task types:
- Text Arena — General conversations/tasks, the main leaderboard[18].
- Vision Arena — Multimodal models for text→image/video/image analysis[25].
- Text-to-Image and Image Edit — Image generation and editing (including the nano-banana case)[16][15].
- Text-/Image-to-Video — Video generation[26].
- WebDev Arena — Building web applications from descriptions[11].
- RepoChat Arena — AI engineering tasks involving code/repositories[12].
- Search Arena — Models with web search integration; first launched in April 2024 (legacy), later moved to the main site, accompanied by a dataset and publication[27][28][29].
- BiomedArena.AI — Domain-specific evaluation for biomedical tasks (in partnership with DataTecnica)[30].
Application and Impact
- Industry Showcase. Major vendors (OpenAI, Anthropic, Google, etc.) regularly test and showcase models on LMArena; industry media describe the platform as an important benchmark[5][31]. In a NAACL-2025 industry publication, Chatbot Arena's Elo score was called the "gold industry-standard"[32].
- Pre-release Testing. The policy allows for anonymous previews of "unreleased" models, with community notification and subsequent publication of public scores after release; a minimum of ≈1000 votes is required for stabilization[19].
- Notable Episodes. In the spring of 2024, the anonymous model Llama-4 Maverick-03-26-Experimental was discussed (an incident surrounding its comparison with public versions), which attracted widespread media attention and prompted updates to rules/communications[33][34]. In August 2024, "nano-banana" was revealed to be Gemini 2.5 Flash Image and took top positions in the visual arenas[15][16].
Limitations and Criticism
Despite its scale and popularity, the approach has limitations:
- Subjectivity and Stylistic Effects. Voting preferences depend on the tone/manner of the response; the team is implementing Style/Sentiment Control to decouple "style" from "substance"[10].
- Unrepresentative Audience. The core active user base consists of tech enthusiasts/developers; specialized arenas (Search, WebDev, Biomed, etc.) are created for domain-specific scenarios[35].
- Vulnerability to Manipulation and Bias. Studies from 2025 show that without strict protections, "vote rigging" strategies with hundreds to thousands of votes are possible; however, collaboration between researchers and LMArena led to the implementation of protective measures (CAPTCHA/login/bot protection/anomaly detection) and an increase in the "cost of an attack"[36][37][38].
- Methodological Criticism. The paper The Leaderboard Illusion (April 2025) points to systematic and institutional factors that can distort the competitive landscape; LMArena published a detailed response and maintains a public changelog of its methodology[39][40][17].
Links
- Official LMArena Website
- LMArena Blog/Policies and Updates
- LMSYS Research Group Website (the project's original incubator)
Further Reading
- Chiang, W.-L. et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
- Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685.
- Li, T. et al. (2024). From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. arXiv:2406.11939.
- Ameli, S.; Zhuang, S.; Stoica, I.; Mahoney, M. W. (2024). A Statistical Framework for Ranking LLM-Based Chatbots. arXiv:2412.18407.
- Boubdir, M. et al. (2023). Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. arXiv:2311.17295.
- Huang, J. Y.; Shen, Y.; Wei, D.; Broderick, T. (2025). Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings. arXiv:2508.11847.
- Xu, Y.; Ruis, L.; Rocktäschel, T.; Kirk, R. (2025). Investigating Non-Transitivity in LLM-as-a-Judge. arXiv:2502.14074.
- Li, H. et al. (2024). LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv:2412.05579.
- Zheng, L. et al. (2024). LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset. arXiv:2309.11998.
- Dubois, Y. et al. (2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. arXiv:2404.04475.
- Singh, S. et al. (2025). The Leaderboard Illusion. arXiv:2504.20879.
- Min, R.; Pang, T.; Du, C.; Liu, Q.; Cheng, M.; Lin, M. (2025). Improving Your Model Ranking on Chatbot Arena by Vote Rigging. arXiv:2501.17858.
Notes
- ↑ 1.0 1.1 Chiang, W.-L. et al. "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference". arXiv:2403.04132, 2024. arXiv
- ↑ "Hello from LMArena: The Community Platform for Exploring Frontier AI". LMArena Blog, June 23, 2024. [1]
- ↑ 3.0 3.1 "Announcing a New Site for Chatbot Arena". LMSYS Blog, September 20, 2024. [2]
- ↑ 4.0 4.1 "LMArena Secures $100M in Seed Funding…". PR Newswire, May 21, 2024. [3]
- ↑ 5.0 5.1 5.2 Wiggers, K. "LM Arena… lands $100M". TechCrunch, May 21, 2024. [4]
- ↑ "Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings". LMSYS Blog, May 3, 2023. [5]
- ↑ "Chatbot Arena Conversation Dataset Release". LMSYS Blog, July 20, 2023. [6]
- ↑ Zheng, L. et al. "LMSYS‑Chat‑1M". arXiv:2309.11998, 2023. [7]
- ↑ 9.0 9.1 Li, T. et al. "From Crowdsourced Data to High‑Quality Benchmarks: Arena‑Hard and BenchBuilder Pipeline". arXiv:2406.11939, 2024. [8]
- ↑ 10.0 10.1 10.2 "Does Sentiment Matter Too? Introducing Sentiment Control". LMArena Blog, April 22, 2024. [9]
- ↑ 11.0 11.1 "WebDev Arena: A Live LLM Leaderboard for Web App Development". LMArena Blog, March 10, 2024. [10]
- ↑ 12.0 12.1 "RepoChat Arena: A Live Benchmark for AI Software Engineers". LMArena Blog, April 9, 2024. [11]
- ↑ "Celebrating Community Impact: 3M+ votes, 400+ models…". LMArena Blog, April 27, 2024. [12]
- ↑ 14.0 14.1 Y. Song. "A Deep Dive into Recent Arena Data". LMArena Blog, July 31, 2024. [13]
- ↑ 15.0 15.1 15.2 "Nano Banana (Gemini 2.5 Flash Image): Try it on LMArena". LMArena Blog, August 27, 2024. [14]
- ↑ 16.0 16.1 16.2 Text‑to‑Image Arena. LMArena, updated August 25, 2024. [15]
- ↑ 17.0 17.1 Leaderboard Changelog. LMArena Blog, August 2024 entries. [16]
- ↑ 18.0 18.1 Text Arena (English). LMArena. [17]
- ↑ 19.0 19.1 19.2 19.3 LMArena Leaderboard Policy. LMArena Blog, ed. September 8, 2024. [18]
- ↑ lm-sys/FastChat (GitHub). [19]
- ↑ FAQ. LMArena. [20]
- ↑ LMArena main page (disclaimer on potential data publication and sharing with providers). [21]
- ↑ Zheng, L. et al. "Judging LLM‑as‑a‑Judge with MT‑Bench and Chatbot Arena". arXiv:2306.05685, 2023. [22]
- ↑ Li, T. et al. "From Crowdsourced Data…" arXiv:2406.11939 (agreement tables). [23]
- ↑ Vision Arena. LMArena, updated September 2, 2024. [24]
- ↑ Text-to-Video and Image-to-Video Leaderboards. LMArena, August 2024. [25] [26]
- ↑ "Introducing the Search Arena". LMArena Blog, April 14, 2024. [27]
- ↑ "Search Arena & What We’re Learning About Human Preference". LMArena Blog, July 23, 2024. [28]
- ↑ Frick, E. et al. "Search Arena: Analyzing Search‑Augmented LLMs". arXiv:2506.05334, 2025. [29]
- ↑ "Introducing BiomedArena.AI". LMArena Blog, August 19, 2024. [30]
- ↑ Google. "Gemma 3…", March 12, 2025 (link to LMArena results). [31]
- ↑ Spangher, L. et al. "Chatbot Arena Estimate…". NAACL Industry, 2025. [32]
- ↑ "Meta’s experimental Llama 4 model briefly topped AI leaderboard…". The Register, April 7, 2024. [33]
- ↑ Official LMArena clarifications/posts on X regarding the incident (April 2024). [34]
- ↑ "Search Arena & What We’re Learning…". LMArena Blog, July 23, 2024. [35]
- ↑ Min, R. et al. "Improving Your Model Ranking on Chatbot Arena by Vote Rigging". arXiv:2501.17858, 2025. [36]
- ↑ Huang, Y. et al. "Exploring and Mitigating Adversarial Manipulation of Voting‑Based Leaderboards". arXiv:2501.07493, 2025. [37]
- ↑ "Hundreds of rigged votes can skew…". Fast Company, February 6, 2024. [38]
- ↑ Singh, S. et al. "The Leaderboard Illusion". arXiv:2504.20879, 2025. [39]
- ↑ "Our Response to ‘The Leaderboard Illusion’". LMArena Blog, May 9, 2025. [40]