OpenAI's o3 AI model scores lower on a benchmark than the company initially indicated
OpenAI's o3 AI model underperforms in third-party tests, raising transparency concerns.

OpenAI launched its o3 AI model with claims of a 25% accuracy in solving FrontierMath problems, outpacing models that scored less than 2%. However, recent findings by Epoch AI revealed the model achieved roughly 10% in third-party tests, significantly lower than OpenAI's original announcement. This difference in performance arose because the most advanced version of o3 used by OpenAI for internal benchmarks had higher computational power than the version later released to the public. Epoch AI, noting the updated FrontierMath version used in its evaluations, indicated potential differences in computational resources and evaluation problem sets as reasons for the performance gap.
ARC Prize Foundation supported Epoch's findings by acknowledging a variance between the public o3, designed for chat/product applications, and previously trialed versions. The cost of tasks on different o3 computation tiers varied significantly, with a high tier costing $34.4k per task, showing larger compute tiers tend to score better on benchmarks. Wenda Zhou from OpenAI addressed this, mentioning optimization for real-world applications in public versions versus the test models, which justified the varying benchmarks.
Despite these discrepancies in benchmark scores, OpenAI plans to release enhanced model variants like o3-pro, expected to surpass its predecessors in performance. Furthermore, developments such as o3-mini-high and o4-mini models reportedly exceed o3’s capabilities on the challenging FrontierMath set. Within the AI industry, this unfolding situation underscores a recurring theme, highlighting the complexities and pitfalls surrounding AI benchmark interpretations.
Benchmark issues aren't isolated to OpenAI. The AI industry commonly faces controversies regarding alignment between data published and publicly available models, potentially impacting stakeholder trust. Instances of misleading representations, like those involving firms such as xAI’s Grok 3 or Meta’s models, have made stakeholders wary of overinflated claims used to capture market attention.
Reflecting broader industry challenges, transparency around AI developments is crucial to maintaining credibility. With rising scrutiny, especially when academia and financial backing are involved, the AI community must navigate balance between industry aspirations and ethical practices, ensuring that results communicated don’t deviate from available consumer tech.
Sources: OpenAI, Epoch AI, ARC Prize Foundation, Wenda Zhou, Mike Knoop