These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models
Researchers use NPR's Sunday Puzzle to evaluate AI reasoning models, revealing surprising results and challenges.

: A team of researchers from various institutions developed an AI benchmark using NPR's Sunday Puzzle to test reasoning models like OpenAI's o1. The benchmark evaluates models on their ability to solve puzzles without esoteric knowledge, revealing strengths and peculiarities in AI reasoning. Some models, like DeepSeek's R1, occasionally provide incorrect answers and demonstrate frustration-like behavior. Researchers plan to expand testing to improve AI models and make benchmarks accessible for broader evaluation.
In a recent study, researchers from institutions including Wellesley College and the University of Texas at Austin used NPR's Sunday Puzzle to create an AI benchmark. This benchmark, consisting of around 600 puzzles, aims to test reasoning models' problem-solving abilities without relying on specialized or rote knowledge.
Models like OpenAI’s o1 and DeepSeek’s R1 were tested, with o1 scoring 59% and R1 scoring 35%. Remarkably, these models sometimes