AI search engines fail accuracy test, study finds 60% error rate
AI search tool study shows significant inaccuracy, with Grok-3 at 96% error and ChatGPT at 28% accuracy.

A recent study conducted by the Tow Center for Digital Journalism highlights the inaccuracies of current AI search engines, revealing a staggering 60% collective error rate. This research scrutinized eight distinct models: ChatGPT Search, Perplexity, Perplexity Pro, Gemini, DeepSeek Search, Grok-2 Search, Grok-3 Search, and Copilot. The team meticulously assessed the accuracy of each tool by referencing 200 news articles from 20 publishers, checking whether the results correctly cited the original article, news organization, and URL. Their findings unveiled pervasive inaccuracies, casting a critical eye on the reliability of AI search engines today.
The analysis disclosed particularly poor performance by certain models, notably highlighting Grok-3 Search, which demonstrated a staggering 94% inaccuracy rate. Microsoft's Copilot, though not as error-prone as Grok-3, still faced considerable challenges, notably declining 104 of 200 queries and achieving complete correctness in just 16 instances. These shortcomings showcase the prevalent issues within AI-driven search engines, such as providing inaccurate information with unfounded confidence.
Anecdotal observations from figures like Ted Gioia, who prominently noted ChatGPT's tendency to assert falsehoods with authority, mirror the quantitative data from this study. Gioia humorously remarked that based on ChatGPT’s responses, he could let it manage his responsibilities while he vacationed. In this vein, the study aligned well with prior accusations that large language models (LLMs) often double down on incorrect assertions, misleading users who rely on these tools for affirmative responses.
Despite the bleak findings, some tech enthusiasts retain praise for these AI tools. Lance Ulanoff of TechRadar expressed satisfaction with ChatGPT Search, citing it as fast, aware, and accurate. His experience hints at the subjective nature of AI reliability, where user optimism can conflict with analytical data. Such perspectives suggest that while AI accuracy is problematic, positive user experiences may persist depending on the context and perceived benefits.
Amidst the scrutiny, another concern arises regarding the transparency and pricing of these tools. Despite high inaccuracies, AI models like Perplexity Pro and Grok-3 charge substantial fees ranging from $20 to $200 monthly. This discrepancy between service quality and cost has raised eyebrows, emphasizing the importance of transparency and accountability from providers. As the field of AI continues to grow, such concerns become paramount in guiding future iterations of these technologies toward more consistent and reliable outputs.
Sources: Tow Center for Digital Journalism, TechSpot, The Honest Broker, TechRadar, Columbia Journalism Review.