Pfizer Challenges Apple Study on AI Reasoning Limitations: Unpacking the Complexity Behind Language Models Performance

A new commentary from researchers at Pfizer challenges the fundamental conclusions of the «Illusion of Thinking» study, co-authored by scientists from Apple.

In an article published by Apple, it is claimed that a sudden drop in performance indicates a fundamental limit to machine reasoning capabilities. Other studies have reported similar findings but do not label it a strict limitation.

The Pfizer team disagrees with Apple’s interpretation. They argue that the decline in performance is not due to cognitive barriers but the artificial testing conditions. When models are constrained to operate solely in a text-based environment—without tools like programming interfaces—complex tasks become significantly more difficult than necessary. What appears to be an issue with reasoning is actually a matter of execution.

In the original study, models such as Claude 3.7 Sonnet-Thinking and Deepseek-R1 were evaluated using text-based puzzles like the «Tower of Hanoi» and «River Crossing.» As the puzzles became more complicated, the accuracy of the models sharply declined, a phenomenon referred to in the study as the «breakdown of reasoning.»

The Pfizer team points to unrealistic testing constraints: models were not allowed to utilize external tools and had to track everything in plain text. This did not reveal reasoning errors but rendered it nearly impossible for the models to complete lengthy and accurate problem-solving steps.

As an example, Pfizer researchers examined the o4-mini model, which labeled a solvable «River Crossing» puzzle as unsolvable, likely because it could not recall previous steps. This memory limitation is a well-documented issue with modern language models, also noted in Apple’s study.

Pfizer describes this as «learned helplessness,» where «if an LRM cannot flawlessly execute a long sequence of actions, it may mistakenly conclude that the task is unachievable.»

Moreover, Apple’s study did not account for «cumulative error.» In tasks with thousands of steps, the likelihood of flawless execution diminishes with each step. Even if a model is 99.99% accurate at each stage, the probability of solving a complex «Tower of Hanoi» puzzle without error falls below 45%. Thus, the observed performance drop may simply reflect statistical reality rather than cognitive limitations.

The Pfizer team retested GPT-4o and o4-mini, this time with access to a Python tool. Both algorithms easily solved simpler puzzles, but their approaches diverged as the complexity of the task increased.

GPT-4o employed Python to implement a logical yet incorrect strategy and failed to recognize the error. In contrast, o4-mini detected its initial mistake, analyzed it, and switched to the correct approach, ultimately leading to a successful resolution.

Researchers link this behavior to classical concepts in cognitive science. GPT-4o operates akin to Daniel Kahneman’s «System 1″—fast and intuitive, but likely to adhere to poor plans. Conversely, o4-mini exemplifies «System 2» thinking: slower, analytical, and capable of revisiting its strategy after recognizing a mistake. Such metacognitive adjustment is considered typical in conscious problem-solving.

The Pfizer team asserts that future LRM tests should examine models both with and without tools. Tool-less tests highlight the limitations of language interfaces, while tests with tools demonstrate what models can achieve as agents. They also advocate for the design of assessments that evaluate metacognitive abilities such as error detection and strategic adjustment.

These results bear significance for safety as well. AI models that blindly follow faulty plans without correction could pose risks, while models capable of reassessing their strategies are likely to be more reliable.

The original «Illusion of Thinking» study by Shojai et al. (2025) sparked widespread discussion about the true capabilities of large language models. Pfizer’s analysis supports these insights but highlights more nuanced aspects of the issue than simply stating that machines lack reasoning ability.

[Source](https://the-decoder.com/researchers-push-back-on-apple-study-lrms-can-handle-complex-tasks-with-the-right-tools/)