Did xAI lie about Grok 3’s benchmarks?
Debate over benchmark accuracy arose between xAI and OpenAI regarding Grok 3's performance.

This week, an OpenAI employee accused xAI, Elon Musk's AI company, of publishing misleading benchmark results for Grok 3. Igor Babushkin from xAI defended their approach by highlighting past issues with OpenAI charts as well.
The contention centers around the omission of the 'consensus@64' or cons@64 scoring method in xAI's graph, which inflates benchmark results by allowing 64 attempts per problem. Critics argue this omission resulted in a skewed perception of Grok 3's capabilities against OpenAI's o3-mini-high model.
Teortaxes from DeepSeek offered a more balanced graph, noting the real issue might be each model's computational and monetary cost in achieving top scores. Nathan Lambert highlighted unresolved questions about the efficiency of AI models and how benchmarks often fail to reveal models' limitations and strengths comprehensively.