Apple Study Uncovers AI Reasoning Limits
A groundbreaking study by Apple researchers has revealed deep flaws in today’s most advanced AI systems. Despite their sophisticated output, many models demonstrate alarming AI reasoning limits, failing catastrophically when problem complexity increases.
Three Regimes of Reasoning Performance
Apple’s study introduces a “three-regime” framework to measure how reasoning models respond to varying levels of task complexity. At low complexity, non-reasoning models surprisingly outperform reasoning models due to overthinking simple problems. These reasoning models often continue evaluating alternatives even after identifying the correct answer.
In the medium complexity range, reasoning models excel. Their structured chain-of-thought logic enables stronger performance than standard LLMs. But at high complexity, all models—regardless of type—collapse in performance, with accuracy dropping near zero.
Learn more about this phenomenon in New Scientist’s breakdown.
The Giving Up Phenomenon
Perhaps the most startling finding is what Apple calls the “giving up” effect. When facing more difficult tasks, models initially expand their reasoning effort by using more tokens. But right before hitting a complexity barrier, they sharply reduce their thinking tokens—despite having unused computational resources.
This behavior highlights a fundamental flaw in how these models function. They don’t “think” in a human sense. Instead, they depend on pattern recognition, which falters when input deviates slightly from familiar formats.
The study showed that even small prompt changes degraded performance by up to 65%, according to Simply Mac.
Controlled Testing Environments
To avoid contamination from widely used datasets like GSM8K or MATH, researchers designed clean test scenarios. These included puzzle-like challenges such as Tower of Hanoi, River Crossing, and Blocks World—each crafted to isolate reasoning complexity without bias.
The results showed that even “thinking” models like Claude 3.7 Sonnet, Gemini Thinking, and OpenAI’s o1/o3 were unable to maintain accuracy once problem complexity crossed a certain threshold. They performed shallow reasoning and failed to scale logic effectively across multiple steps.
Implications for AGI Development
These findings cast doubt on the notion that today’s AI models are on a clear path toward Artificial General Intelligence (AGI). Far from demonstrating true cognitive abilities, they appear to be highly advanced pattern matchers.
The Council on Foreign Relations also raises concerns about reasoning failures in AI in this report.
With Apple poised to debut its Apple Intelligence platform during WWDC 2025, the timing of this study couldn’t be more critical. It not only informs public understanding but also signals a need to rethink how we define and measure progress in AI.