Did xAI lie about Grok 3’s benchmarks?

Debate over benchmark accuracy arose between xAI and OpenAI regarding Grok 3's performance.

: xAI is accused of misrepresenting Grok 3's benchmark performance by omitting critical scoring methods. OpenAI contended that the exclusion of cons@64 made Grok 3 seem more advanced compared to OpenAI models. While xAI claims consistency in reporting, the real cost and computational details of achieving benchmark scores are often not disclosed.

This week, an OpenAI employee accused xAI, Elon Musk's AI company, of publishing misleading benchmark results for Grok 3. Igor Babushkin from xAI defended their approach by highlighting past issues with OpenAI charts as well.

The contention centers around the omission of the 'consensus@64' or cons@64 scoring method in xAI's graph, which inflates benchmark results by allowing 64 attempts per problem. Critics argue this omission resulted in a skewed perception of Grok 3's capabilities against OpenAI's o3-mini-high model.

Teortaxes from DeepSeek offered a more balanced graph, noting the real issue might be each model's computational and monetary cost in achieving top scores. Nathan Lambert highlighted unresolved questions about the efficiency of AI models and how benchmarks often fail to reveal models' limitations and strengths comprehensively.