A new, challenging AGI test confounds most AI models
ARC-AGI-2 puzzles AI models, challenging their adaptability and efficiency.

According to a blog post from the Arc Prize Foundation, a nonprofit organization co-founded by notable AI researcher François Chollet, a new test named ARC-AGI-2 has been introduced. The test is designed to evaluate the general intelligence of prominent AI models by requiring them to solve complex, puzzle-like problems that involve identifying visual patterns within grids of different-colored squares. Unlike its predecessor ARC-AGI-1, which allowed AI models to rely on extensive computing or 'brute force,' ARC-AGI-2 demands interpretable problem-solving strategies, focused on efficiency rather than sheer computational power.
The performance of current AI models on the ARC-AGI-2 test has been underwhelming, with OpenAI's o1-pro and DeepSeek's R1 scoring between 1% and 1.3%. More robust non-reasoning models like GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash also scored around 1%. In contrast, human participants set the baseline by achieving an average score of 60%. This highlights a significant gap in the adaptability of AI systems when faced with novel challenges not covered in their training data.
Chollet has publicly claimed that the new ARC-AGI-2 test is a more accurate indicator of a model’s practical intelligence compared to ARC-AGI-1, which had its limitations due to high reliance on computational power. He mentioned that ARC-AGI-2 incorporates a key measure of efficiency, thus pushing AI models to interpret and process information on the fly and preventing reliance on rote memorization.
The rollout of ARC-AGI-2 arrives amid calls within the technology industry for fresh benchmarks aimed at gauging AI's progressiveness in innovative and creative characteristics. Thomas Wolf, co-founder of Hugging Face, echoed this sentiment to TechCrunch, emphasizing the necessity of tests reflecting artificial general intelligence (AGI) capabilities. The Arc Prize Foundation has responded with a new contest, Arc Prize 2025, challenging developers to attain 85% accuracy on ARC-AGI-2 while operating within a cost constraint of $0.42 per assignment.
Notably, OpenAI's highly advanced reasoning model, o3, which previously outperformed all other AI models on ARC-AGI-1, only managed a 4% score on ARC-AGI-2 despite a significant expenditure of $200 per task. This discrepancy illustrates not only the sophistication of the new test in evaluating true intelligence but also the importance of measuring the efficiency of problem-solving skills. The shifts proposed by ARC-AGI-2 are part of a broader push toward more sustainable, efficient, and adaptable AI technologies.
Sources: TechCrunch, Arc Prize Foundation blog, François Chollet (via X).