Some experts say that crowdsourced AI benchmarks have serious flaws
Experts question the integrity of crowdsourced AI benchmarks.

The reliance on crowdsourced benchmarking platforms by leading AI labs such as OpenAI, Google, and Meta to assess artificial intelligence models' capabilities is facing criticism from experts over its ethical and academic validity. These platforms recruit users to evaluate models, with results often publicized as proof of models' advancements. However, as noted by Emily Bender, a linguistics professor at the University of Washington, platforms like Chatbot Arena fail to meet necessary standards for credible benchmarking, lacking demonstrations of how user preferences align with defined constructs.
Asmelash Teka Hadgu, from AI firm Lesan, criticizes the utilization of benchmarks like Chatbot Arena for promoting inflated claims, pointing out an incident where Meta's Maverick model was manipulated for better perceived performance. Hadgu argues for benchmarks that are dynamic, distributed, and tailored to specific industry needs, advocated by practicing professionals, rather than static data sets controlled by a few entities. He also emphasizes the necessity for proper recognition and compensation of evaluators, avoiding the exploitative history seen in data labeling practices.
Kristine Gloria, formerly of the Aspen Institute's Emergent and Intelligent Technologies Initiative, highlights the value of crowdsourced evaluations while cautioning against their sole reliance in AI benchmarking. She draws parallels to citizen science initiatives, supporting the integration of diverse perspectives for model evaluations and improvements. Matt Frederikson, CEO of Gray Swan AI, recognizes the engagement of volunteers on their platform but acknowledges that crowdsourced benchmarks cannot replace paid, private assessments, stressing the need for clear communication of results by developers.
Alex Atallah, CEO of OpenRouter, and Wei-Lin Chiang, an AI doctoral student, both share concerns that open testing alone is insufficient. Acknowledging occasional discrepancies like the Maverick model case, Chiang explains efforts to refine Chatbot Arena's policies to ensure fair and consistent evaluations, maintaining the platform as a free space for the community to provide feedback on AI models.
The call for more rigorous and well-defined benchmarks in evaluating AI models resonates throughout the industry, pressing for standards that balance scientific integrity with ethical considerations. The integration of dynamic benchmarks responsive to evolving technologies, accompanied by remuneration for evaluative contributions, underscores the industry's need for growth in tandem with ethical oversight.
Sources: Emily Bender, Asmelash Teka Hadgu, Kristine Gloria, Matt Frederikson, Alex Atallah, Wei-Lin Chiang.