The Great AI Reasoning Show: Why Your 'Thinking' Chatbot Might Just Be Really Good at Pretending

Or: How Large Reasoning Models Learned to Look Smart While Basically Playing Very Expensive Guess-and-Check

Note: This article is based on the research paper “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity” by Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar from Apple (2025).

Picture this: You ask your AI assistant to solve a complex math problem, and instead of immediately spitting out an answer like the old days, it pauses dramatically and shows you its “thinking.” You watch, mesmerized, as thousands of words unfold before your eyes. The AI reconsiders, backtracks, has “aha moments,” and eventually arrives at the correct answer. Surely, you think, this machine is actually reasoning—maybe even better than you do.

Well, I hate to break it to you, but you might have just witnessed the most sophisticated magic trick in the history of computing.

Welcome to the Age of Large Reasoning Models

The latest generation of AI models—called Large Reasoning Models or LRMs for short—have learned a new party trick. Instead of immediately answering your questions, they show you their “work.” Models like OpenAI’s o1, DeepSeek-R1, and Claude’s thinking variants generate thousands of tokens of internal deliberation before producing their final response. It’s like having a really verbose student who writes down every single thought while solving a problem, complete with self-corrections and moments of clarity.

These models represent what many consider a breakthrough in artificial intelligence. After all, showing your work is what we teach humans to do, right? The ability to reason step-by-step, catch your own mistakes, and think through complex problems is supposedly what separates intelligence from mere pattern matching.

But here’s where things get interesting—and a little bit ridiculous.

How These “Thinking” Machines Actually Work

To understand what’s really happening inside these models, imagine you’re training a very sophisticated parrot. This parrot has read practically everything humans have ever written about reasoning, mathematics, and problem-solving. Now, instead of teaching it specific tricks, you create a simple game: every time the parrot says something that leads to a correct answer, you give it a treat. Every time it doesn’t, you withhold the treat.

This is essentially what happens with Large Reasoning Models, except instead of treats, we use mathematical rewards, and instead of a parrot, we have a neural network with hundreds of billions of parameters.

The training process works like this: The model is given thousands of problems and learns that when it generates lots of text that looks like reasoning—complete with phrases like “wait, let me reconsider this” and “actually, I think I made an error”—it’s more likely to stumble upon the correct answer. Over time, it becomes exceptionally good at generating this reasoning-flavored text.

The magic happens through something called reinforcement learning, which is basically a fancy way of saying “learn by trial and error with rewards.” The model doesn’t start knowing how to reason. Instead, it discovers that certain patterns of text generation are more likely to lead to rewards. If generating more text increases the chances of getting the right answer, well, the model learns to generate more text. If self-correction helps avoid obvious mistakes, the model learns to self-correct. If having “aha moments” makes humans think it’s smart, the model learns to have aha moments.

The really clever part is that nobody explicitly programmed these behaviors. The model developed them because they were useful for getting rewards. This is what researchers excitedly call “emergent behavior”—sophisticated strategies that arise naturally from simple rules.

But here’s the crucial point that often gets overlooked: these models are fundamentally shaped by algorithms imposed from the outside. The neural network itself has no intrinsic motivation to reason or think—it’s simply responding to external optimization pressures. Stochastic Gradient Descent (SGD) and reinforcement learning algorithms externally guide the model toward behaviors that maximize reward. The model doesn’t “want” to learn or improve in any meaningful sense; it’s being navigated through parameter space by algorithmic forces entirely separate from the model itself.

Think of it like a marble rolling down a carefully constructed landscape. The marble doesn’t choose its path—the landscape’s topology determines where it goes. Similarly, these models don’t develop reasoning capabilities through any internal drive or understanding. They’re being pushed and pulled by external optimization algorithms that reward certain text generation patterns over others.

This creates what might be the deepest philosophical challenge of all: if the learning process itself is driven entirely by external forces rather than any internal motivation or understanding, can we really say the model has learned to “reason” in any meaningful sense? Or has it simply been sculpted by external algorithms into a shape that produces reasoning-like outputs?

But here’s the thing about emergence: just because something looks impressive doesn’t mean it’s what you think it is.

The Brute Force Behind the Curtain

When you watch one of these models “thinking,” what you’re really seeing is brute force exploration dressed up in the language of reasoning. The model has learned that if it throws enough computational spaghetti at the wall, something will eventually stick. It generates multiple approaches, tries different angles, and keeps going until it hits upon something that works.

Think of it like a really persistent student who doesn’t quite understand the material but has figured out that if they write enough words and try enough different approaches, they’ll eventually stumble across the right answer. Except this student writes incredibly fast and has perfect recall of every math textbook ever written.

The model hasn’t learned to reason in the way humans do. Instead, it has learned to navigate through the space of possible text sequences in a way that maximizes its chances of producing text that ends with a correct answer. It’s incredibly sophisticated pattern matching with a search algorithm on top.

This isn’t necessarily a bad thing. Brute force can be remarkably effective, especially when you have virtually unlimited computational power and access to all of human knowledge. But it does raise some interesting questions about what we mean when we say an AI is “reasoning.”

The Three Fatal Flaws

Recent research has identified three fundamental problems with these Large Reasoning Models that reveal the limits of this approach. These flaws are particularly embarrassing because they show up in surprisingly simple situations.

Flaw #1: The Complexity Collapse

Here’s where things get really interesting. Researchers discovered that these models perform well on simple problems, get even better on moderately complex problems, but then completely fall apart when things get truly challenging. It’s not a gradual decline—it’s a cliff.

Imagine a student who can handle basic algebra, excels at intermediate problems, but then becomes completely unable to solve anything once you add just a few more variables. That’s exactly what happens with these models. They hit a complexity threshold and their performance doesn’t just decrease—it collapses to zero.

Even more bizarrely, as problems get harder and approach this complexity cliff, the models actually start “thinking” less, not more. You’d expect that more difficult problems would require more deliberation, but these models do the opposite. It’s as if our hypothetical student started giving shorter and shorter answers as the problems got harder, eventually just shrugging and walking away.

This suggests that these models aren’t really scaling their reasoning power with problem complexity the way you’d expect from genuine reasoning systems. Instead, they seem to have learned a bag of tricks that work up to a certain point, and beyond that point, they simply don’t know what to do.

Flaw #2: The Algorithm Execution Failure

This flaw is perhaps the most revealing of all. Researchers tried something that should have been a softball for any reasoning system: they gave the models explicit, step-by-step algorithms for solving problems. All the models had to do was follow the instructions.

In human terms, this would be like giving someone a recipe and asking them to bake a cake. You’re not asking them to invent cooking—just follow the directions.

The models failed spectacularly.

Even when provided with complete, correct algorithms, these supposed reasoning machines couldn’t reliably execute the logical steps. They would fail at roughly the same points where they failed when trying to solve the problems from scratch. This is particularly damning because following an algorithm should require much less “reasoning” than deriving a solution independently.

This suggests that these models aren’t actually manipulating logical concepts or following chains of reasoning. Instead, they’re generating text that resembles reasoning but lacks the underlying logical structure that would allow them to systematically work through a problem.

It’s like discovering that your apparently brilliant student can write beautiful essays about mathematics but can’t actually do arithmetic when you give them a calculator and explicit instructions.

Flaw #3: The Inconsistency Problem

The third flaw reveals perhaps the most human-like limitation of these models: they’re incredibly inconsistent in ways that don’t make logical sense.

Researchers found that the same model might correctly execute over 100 sequential logical steps in one type of problem but fail after just 4 steps in a different but equally complex problem. This isn’t about one problem being inherently harder than the other—it’s about the model having learned different patterns for different types of problems.

The explanation appears to be embarrassingly simple: training data familiarity. The model performs well on problems similar to ones it has seen many times during training and poorly on problems that were rare in its training data. This suggests that what looks like reasoning is actually sophisticated memorization and pattern matching.

It’s as if you discovered that your brilliant student could solve incredibly complex physics problems but couldn’t figure out how to make change for a dollar because they’d never practiced that specific type of problem before.

The Performance Theater of Artificial Intelligence

What we’re witnessing with Large Reasoning Models is essentially performance theater. These models have become incredibly sophisticated at generating text that looks like reasoning, sounds like reasoning, and often even produces the same results as reasoning, but lacks the underlying logical structure that defines genuine reasoning.

This isn’t necessarily a criticism of the technology. The results can be genuinely useful, and the engineering achievement is remarkable. But it does suggest that we should be more careful about the claims we make regarding these systems’ capabilities.

When a model generates thousands of words of “thinking” and arrives at a correct answer, what has really happened is that it has successfully navigated through a vast space of possible text sequences using patterns it learned during training. It has become exceptionally good at generating reasoning-flavored text that tends to lead to correct conclusions.

The “aha moments,” the self-corrections, and the careful deliberations are all learned behaviors that the model discovered were useful for achieving its training objectives. They’re not genuine moments of insight but rather strategic text generation patterns that increase the probability of success.

What This Means for the Future

Understanding these limitations doesn’t mean Large Reasoning Models are useless—quite the contrary. They represent a significant advance in AI capabilities and can be genuinely helpful for many tasks. But recognizing what they actually are, rather than what they appear to be, is crucial for understanding their proper applications and limitations.

These models excel at pattern matching across vast amounts of information, exploring solution spaces through text generation, and mimicking the structure of human reasoning. They’re powerful tools for augmenting human intelligence, particularly in domains where their training data is rich and the problems fall within their learned patterns.

However, they’re not actually reasoning in the way humans do, and they can’t be relied upon to handle novel situations that fall outside their training distribution. They’re sophisticated text generators that have learned useful strategies for producing helpful outputs, not thinking machines that can genuinely understand and manipulate logical concepts.

Perhaps most importantly, these findings suggest that the path to artificial general intelligence might be more complex than simply scaling up current approaches. While these models represent impressive engineering achievements, the three fundamental flaws reveal that we may need fundamentally different approaches to create truly reasoning artificial minds.

In the meantime, we can appreciate these models for what they are: remarkably sophisticated pattern matchers that have learned to put on an impressive show. Just don’t be too surprised if the magician occasionally reveals how the trick is done.

The real question isn’t whether these models are genuinely reasoning—it’s whether that even matters as long as they’re useful. But that’s a question for philosophers and ethicists to ponder while the rest of us figure out how to make the most of these fascinating, flawed, and surprisingly effective artificial minds.

P.S.

This text was written mainly by Claude—the same type of “reasoning” AI we’ve been discussing. It helped me distill the essence of the research paper and write the text in correct English, filling in gaps after we brainstormed about the topic for an hour through back-and-forth discussion.

The irony isn’t lost on me that I used an AI to critique AI reasoning capabilities. But perhaps this collaboration itself illustrates the point: Claude excelled at pattern matching across vast amounts of information, organizing complex ideas into readable prose, and mimicking the structure of analytical writing. What it provided wasn’t genuine reasoning about AI limitations, but rather sophisticated text generation that helped me articulate my own understanding of the research.

The real thinking—the interpretation of what these findings mean, the connections to broader questions about intelligence, and the critical analysis of the implications—came from our human conversation. Claude was an excellent writing partner, but the reasoning about reasoning? That was distinctly human.

Which might just be the most important lesson of all: these AI systems are powerful tools for augmenting human intelligence, not replacing human thought. At least not yet.