The new reasoning AI models by OpenAI tend to hallucinate more often
OpenAI's latest AI models hallucinate more, posing challenges.

OpenAI's recent introduction of the o3 and o4-mini models marks an advancement in AI reasoning capabilities, yet it also brings an unexpected increase in hallucination rates. Despite their enhanced performance in coding and mathematical tasks, these models tend to make more incorrect statements or hallucinations. According to OpenAI’s internal analysis, o3 has a 33% hallucination rate on PersonQA, while o4-mini experiences an even higher rate of 48%. These figures are significantly higher compared to older models like o1 and o3-mini, which have hallucination rates of 16% and 14.8% respectively.
The unexpected rise in hallucinations has left OpenAI puzzled, prompting their technical report to suggest that more research is required to fully understand the underlying causes. Sarah Schwettmann from Transluce provided insights suggesting that the models’ reinforcement learning methods could be inflating hallucination rates, impacting the models' practical utility. Additionally, instances have been documented by Transluce where the o3 model incorrectly claimed to perform actions like running code on unavailable hardware, compounding reliability concerns.
Practical implications of these hallucinations are significant, as highlighted by Sarah Schwettmann, who noted that these inaccuracies might reduce the model's utility. This poses a challenge for businesses that prioritize accuracy, such as legal firms, who cannot afford inaccuracies in documentation. Despite this, professor Kian Katanforoosh noted o3's strong performance in coding, even as it sometimes hallucinated broken links.
While hallucinations can spur creativity and novel ideas in AI models, they present prominence issues for sectors demanding precision. OpenAI is exploring solutions such as integrating web search capabilities to enhance model accuracy. For instance, incorporating search mechanisms has enabled GPT-4o to achieve 90% accuracy on the SimpleQA benchmark, suggesting potential improvements for reasoning models' reliability.
This heightened incidence of hallucinations underscores a broader industry shift towards evolving reasoning models, in search of achieving greater AI efficiencies without extensive hardware requirements. However, as hallucinations remain an obstacle, it’s a critical area for ongoing research. OpenAI spokesperson Niko Felix has reiterated the commitment to reducing these inaccuracies, affirming continual development and refinements to address this challenge.
Sources: TechCrunch, OpenAI, Transluce