AI fails logic test: studies reveal illusion of reasoning
AI models fail to reason effectively, with performance dropping to 4-24% in complex tasks.

The illusion of reasoning in AI models has been critically examined by two studies, unveiling the inadequacies in their ability to process complex logical tasks. Apple's research tested several leading AI models using the Tower of Hanoi puzzle, highlighting a significant drop in performance as complexity increased. Apple's study demonstrated that current AI systems often flounder on tasks requiring step-by-step logic, as they tend to rely more on pattern prediction than actual comprehension of rules and constraints. Despite confidently generating responses, these systems frequently contradict themselves and make blatant errors.
Apple's findings align with a study by ETH Zurich and INSAIT, where AI models were evaluated on problems from the 2025 USA Mathematical Olympiad. None of the nearly 200 AI solutions produced a complete correct answer. Google's Gemini 2.5 Pro garnered 24% of the total available points, not by solving a quarter of the challenges but rather through partial solutions that earned partial credits.
The studies revealed unusual errors where models invented constraints based on quirks in training data, such as standard practices like boxing final answers, even when contextually irrelevant. Gary Marcus, an established critic of AI technologies, expressed his concerns regarding the effectiveness and development of AI model logic, calling the discovery "pretty devastating." He outlined the failure of AI models to solve problems that early AI pioneers successfully tackled decades ago as troubling.
Sean Goedecke, an AI specialist, argued that the problem highlighted isn't merely a failure in reasoning but an adaptive response when faced with high complexity. For example, as the complexity of puzzles surpasses the model's reasoning capacity, it defaults to shortcut strategies that do not always succeed. This adaptation often involves the model shifting its approach from iterative sequence reasoning to identifying generalized solutions.
Despite advances in fine-tuning models for reasoning via chain-of-thought prompting, the studies suggest these models hit cognitive walls with increased complexity. The paper implies that AI's mimicked logic arises from fluency rather than insight, underscoring the need for combined approaches in AI development, such as merging large language models with logical verification mechanisms. Until then, the potential for AI to simulate human-level intelligence remains out of reach.
Sources: TechSpot, Apple, ETH Zurich