OpenAI's latest o3 and o4-mini models perform exceptionally well in coding and math, but they tend to hallucinate more frequently
OpenAI's o3 and o4-mini excel in math but hallucinate more: 33%-48%.

OpenAI has introduced its latest artificial intelligence models, o3 and o4-mini, which have demonstrated significant advancements in coding, math, and multimodal reasoning tasks. These improvements position these models as groundbreaking tools in AI technology. However, despite the achievements, a critical issue has come to light—these models exhibit higher hallucination rates than previous versions. A report by TechCrunch highlights that o3 shows a hallucination rate of 33% on OpenAI's PersonQA benchmark, while o4-mini's rate elevates to 48%.
This development marks a shift in the trend that has characterized the progress of AI models, where prior iterations such as o1, o1-mini, o3-mini, and GPT-4o experienced a reduction in hallucination rates over time. The reasons behind this unexpected regression are currently unclear, with OpenAI's own researchers acknowledging the need for more investigation into why scaling reasoning models could exacerbate the problem. Neil Chowdhury, a researcher from Transluce and former OpenAI employee, posited that the reinforcement learning techniques used might amplify issues mitigated previously.
The implications of these hallucinations are significant, especially where accuracy is paramount, such as in legal and financial sectors. As Sarah Schwettmann from Transluce notes, the higher hallucination rates could limit the usefulness of these models in real-world applications. Kian Katanforoosh from Stanford emphasizes the problem in coding-related tasks, where o3 often generates non-functional website links, potentially posing substantial risks to business operations.
To address these challenges, OpenAI admits that improving accuracy and reliability is an ongoing endeavor. One potential solution under exploration is integrating web search capabilities, which has shown promise in enhancing factual accuracy, as evidenced by GPT-4o achieving 90% accuracy on the SimpleQA benchmark when such features are enabled. This could allow for responses to be grounded in verifiable information, an especially valuable trait when user trust and utility are at stake.
The AI industry's focus is shifting towards reasoning models due to their capability to handle complex tasks without substantially more data and computational demand. Nonetheless, as the case of o3 and o4-mini illustrates, these promising directions present their own set of hurdles, notably the risk of hallucinations, which threaten the credibility and application of such advanced AI systems.
Sources: TechCrunch, TechSpot