The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

2nd Order Thinkers.

Is Apple the Only One Getting AI Right!?

0:00

-28:45

Is Apple the Only One Getting AI Right!?

You’re about to see what Apple gets right while everyone else is busy being disappointed.

Jing Hu

Jun 21, 2025

Transcript

Another year, another WWDC disappointment.

The tech media is having déjà vu.

Just like in 2024, journalists left Apple's developer conference underwhelmed, fixated on flashy UI updates while missing the deeper story. This time, the media obsesses over "Liquid Glass" design changes and the lack of wow factors in their AI announcement.

Many of my designer collections criticized Apple for prioritizing "vibes" over functionality, one of them wrote:

iOS 26 is visually stunning. And functionally worse than ever... failed in real-world use.

I feel for Steve Jobs.

If he were still alive, the only redesign by this point, will be the iPhone UI team’s org chart.

Then, of course, the markets soon react to it. One analyst described:

the WWDC keynote is"a dud," especially when compared with other recent software developers conferences, such as Alphabet's (GOOGL) Google I/O.— Investors.com

Apparently, the majority of analysts have the same takeaway.

Apple's stock dropped 2.5% after the conference, down 20% year-to-date. Investors seem convinced Apple is falling behind in the AI race.

Against everyone's disappointment, I am going to explain why:

Apple is one of the few tech giants getting AI right.

TL;DR

Apple and Salesforce’s latest studies proved that even the most advanced models fail basic logic puzzles and can't follow simple instructions. For the time being, AI is just sophisticated pattern matching, no thinking, no reasoning.
AI can only do tasks accurately up to a certain complexity. The performance drops to zero if you go beyond this point.
Running AI directly on devices is much less costly for developers (no token fee) and eliminates privacy risks. Where AI can reliably perform small tasks, leveraging what AI does best: pattern recognition and content generation.
Apple built an uncopyable competitive advantage by accepting AI's real limitations, while most others remain trapped by their cloud infrastructure and per-token billing models.

Shall we?

Agentic AI = A Trillion Dollar Marketing Exercise

The tech world and media are obsessed with agentic AI.

In theory, these are supposed to be systems that can think, plan, and act all by themselves, without anyone telling them what to do.

You hear big claims.

2025, Sam Altman, OpenAI CEO, says AI agents will 'join the workforce' this year; or that Jensen Huang, Nvidia CEO, calls it a 'multi-trillion-dollar opportunity.' Every new tech startup trying to get money from investors promises 'autonomous AI employees' that will completely change how businesses work.

It could very well be true. Maybe they have been hiding the AI capability so well that the true innovation has not yet been released, or… This is all a Sweet Dream.

When something sounds too good to be true, it usually is.

As I've discussed, today's agents are more like fancy chatbots that follow a script. They're not truly independent thinkers. Some context:

But the problem goes even deeper than just over-promising. The whole idea behind it is flawed. Those tech companies claim their Agentic AI can:

Figure out really complicated problems with many steps.
Can learn and change when they face new situations.
Be a digital worker who doesn't need much help or watching over.

Let me walk you through the proof Apple recently found of why this may be wishful thinking.

Flawed Benchmarks → The Illusion Of Thinking

A bombshell study that Apple researchers just released: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.

It’s a mouthful, I know, but the finding is beautifully simple and profoundly important.

Before we start, just to make sure you know:

Large Language Models (LLMs), AI systems trained on massive amounts of text… which most of you already knew. They're great at generating writing, summaries, and stories.

Large Reasoning Models (LRMs) are LLMs with an added feature. They "think" step-by-step before answering, writing out their reasoning process. This Chain-of-Thought (CoT) approach is supposed to improve their logic and problem-solving.

This study aims to answer:

Could these models solve problems?
IF yes, how did they approach them?
And more importantly, where do their limits truly lie?

And indeed, these LRMs showed some improvements on various benchmarks, particularly in areas like math and coding.

Many now believe we’re on the cusp of truly general artificial intelligence, machines that can reason and solve problems much like humans do.

But!

How do you tell if these models are truly “thinking” in a human sense, or are they just incredibly sophisticated pattern-matching machines? And how well do they perform when faced with problems that push their limits?

While other big tech companies like OpenAI, Google, and Microsoft are spending tons of money chasing after super-smart AI that can do everything…

Apple is one of the few commercial companies that have finally asked a simple question that many researchers have been asking for years (e.g., Gary Marcus or Melanie Mitchell): Can these reasoning AI models actually reason?

You may ask… Why does this matter?

Think about when you were studying. You wanted to improve your score, but at least you need to know where it went wrong, correct? Equally, if we’re going to improve AI, we need to understand its true nature.

Turns out… these AI models don't just fail to truly reason; they're really good at making it look like they are.

The Great Benchmark Deception

A dirty little (open) secret in the AI world: most of the AI benchmarks are intrinsically flawed and can be manipulated.

Here’s how the benchmarks are used:

The researchers give them problems from big, important tests like MATH-500 for math, AIME for advanced math, HumanEval for coding, and MMLU for general knowledge. (No point remembering these, so don't worry.)
When a super-advanced AI like GPT-4 or Claude 4 Opus could score 90% on these tests.
Then Altman or Amodei of these tech companies cheer and tell everyone how smart their AI is.
News headlines brag about the marginal improvement, and the value of these companies goes way up.

But here's the obvious problem that most seem to ignore (or are being blinded to): These tests, and their answers (or something close enough), are already all over the internet.

The Contamination Crisis

You were a student once.

Your teacher gives you an exam. (No, I know most of you don't like to study. Me either, I spend most of my study years on Dota or World of Warcraft.) Before you even start, you find out that all the questions, and even the correct answers, have been posted online for years.

AND, assuming you have the ability to remember everything you have seen.

Now, you’ve scored 10/10 on that test. But is this because you are really good at memorizing, or do you actually understand the theory? (Yes, there’s a difference between memorising and understanding.)

This study found that AI models did worse on a newer test, AIME25 (from 2025), than on an older one, AIME24 (from 2024). What’s more interesting is that human test-takers found AIME25 easier than AIME24 (the opposite of AI).

The likelihood is that AIME24 had an extra year for its questions and answers as part of the AI's training data.

The Three Stages When/ How AI Collapses

Instead of using potentially contaminated math or coding tests, Apple created controlled puzzle environments. The tests are designed to be an honest system so that

These puzzles aren't floating around in the huge datasets that AIs are trained on. So, the AI can't just memorize the answers.
You can easily make them harder or easier. For example, just adding one more disk to the Tower of Hanoi makes it measurably more difficult.
Either the AI solved it correctly, or it didn't. There's no in-between.
And most importantly, to solve these, the AI actually has to use logical reasoning, not just remember patterns.

Apple tested AI using four puzzles: Tower of Hanoi, Checkers Jumping, River Crossing, and Blocks World. Where the complexity scales with the number of elements, ie, disks, checkers, people, or blocks.

Figure 3: Illustration of the four puzzle environments.

They then found three clear patterns as the puzzles got more and more complex.

Stage 1: Simple Problems (The Overthinking Trap)

For easy puzzles, something strange happened… the regular AI models actually did better than the fancy reasoning models.

Why!?

It seems like the reasoning models fell into something called the overthinking phenomenon. The thinking models wrote thousands of words, exploring paths that weren't needed, second-guessing correct answers, and making things complicated when they should be simple.

Here's an example:

For a 3-disk Tower of Hanoi puzzle, which you can solve in just seven moves, the Claude-3.7 thinking mode uses 10x more tokens than the regular version. While achieving the same accuracy.

It's like driving a rocket ship to the corner store.

The technology works, but it's totally unnecessary and makes things harder than they need to be.

Stage 2: Medium Complexity (The Sweet Spot?)

In medium-complexity problems, reasoning models beat regular AI… In a weird way.

All reasoning models tend to explore wrong answers first, then eventually find the right one. It's the only place where they look impressive.

However, these thinking models use 10- 50x more computing power (15,000-20,000 tokens vs 1,000-5,000). Imagine paying $500 instead of $50 for a hamburger that tastes 10% better.

When most AI companies demo their agentic model on stage, these models were performing medium-level complexity tasks. Because they are…

Complex enough to impress audiences
Simple enough to avoid total failure
Hides both the wasteful overthinking AND the complete breakdown

Even in their sweet spot, reasoning models only delay the inevitable collapse. They hit the same wall as regular AI, just slightly later.

The medium complexity range isn't really a breakthrough.

Rather, a narrow window where inefficiency looks like intelligence.

Stage 3: High Complexity, Where AI Gave Up.

As the problems get more and more complex, these 'reasoning' models did something strangely human: they stopped trying.

The research demonstrates that all frontier reasoning models, o3-mini, DeepSeek-R1, and Claude-3.7-Sonnet (thinking), exhibit the same fundamental pattern: give up once the problems are too complex.

The models’ accuracy drastically declines as problem complexity increases until reaching complete collapse with zero accuracy. And the accuracy remains zero regardless of further increases in problem difficulty.

In addition, they found that AI models actually think less when problems get harder.

At first, they use more computing power as problems get more complex. But right before they fail entirely, they surprisingly give up and reduce their effort, which explains why their accuracy drops so sharply.

Even though they had plenty of computing power left, they often used less than 20% of what they were allowed.

For no good reason, the thinking models stopped generating more 'reasoning' words. It was like the models somehow 'knew' they couldn't solve the problem and decided to save their energy.

As the research put:

despite operating well below their generation length limits with ample inference budget available, these models fail … as problems become more complex. This behavior suggests a fundamental scaling limitation in the thinking capabilities of current reasoning models…

This is a fundamental scaling limitation in the thinking capabilities of current reasoning models relative to problem complexity.

Maybe this is why we often feel the models are dumbed down.

This breakdown occurs across multiple dimensions simultaneously:

Computational Resource Misallocation. Models show a two-phase failure: first, working harder for worse results (more tokens, declining accuracy), then giving up entirely (both effort and performance crash). It's like a student cramming all night, first writing more while understanding less, then completely burning out.
Examining the models' work shows they literally cannot produce viable attempts beyond a certain complexity. Imagine a chef who can't even start when the recipe has too many steps.
The AI models couldn't follow simple instructions. Researchers gave them the exact steps to solve puzzles, like giving someone a recipe, but they still failed when the puzzles got complex. This proves AI's problem is that it can't reliably follow logical steps.
This collapse pattern is NOT limited to specific model architectures or training approaches. All frontier reasoning models exhibit fundamentally similar collapse behaviors despite different training methodologies, architectural choices, computational resources, and so on.

This suggests that current approaches to reasoning in LLM are encountering fundamental barriers.

The fact that models literally reduce their reasoning effort when facing harder problems is a breakdown in the very mechanisms that are supposed to enable sophisticated reasoning.

Simple tasks, they overthink; medium tasks, they look smart; and complex tasks, they fail completely.

These AI reasoning models, the ones that tech CEOs claim will change how we work, take over jobs, and create trillions of dollars in value…

The experiments told you a completely different story:

AI reasoning models aren't truly reasoning.

They're just very good at finding patterns. And when those patterns aren't in the huge amount of data they were trained on, they're completely helpless.

If these AIs can't reliably follow clear instructions, how can we possibly trust them with things like surgery, managing our money, self-driving cars, etc.?

We can't.

Why Is Apple's "Boring" Strategy Actually Smart?

Now, with all the results from this study in mind…

Apple's 2025 AI approach suddenly seems the only sensible move instead of being too careful or timid.

Rather than chasing after the impossible dreams of a super AI, which fall apart in many simple use cases, Apple asked some very different, very practical questions:

What can AI actually do reliably and effectively in the next few years?
What value can we provide to users?
Is there a business at the intersection of points 1 and 2, which is realistic and sustainable?

Only Focus On What AI Does Well.

If AI models are going to fail on complex reasoning tasks anyway, why not focus on what they can do well? Yes, there are things AI models are great at:

Pattern recognition, like finding specific people or objects in your photos, or analyzing text.
Understanding context, like summarizing your notifications (partially…with occasional mistakes).
Tasks with a limited scope, like helping you write something or answering simple questions.

I understand. You find these less exciting and boring after using ChatGPT for three years.

But unlike those big promises of super-smart AI, these are a much better understanding of the limitations. And they are relatively reliable compared to the recent models for commercial purposes.

On Device Solution. The Moat That Others Can't Cross.

Most failed to understand the following when they were complaining about Apple's unimpressive updates: That Apple changed the entire cost of using AI.

Think about a “traditional” AI commercial model (like those from ChatGPT, Perplexity, and Github Copilot):

They need a huge data centre to run them.
You pay based on how much you use them (per token).
They create privacy nightmares because your data goes to their servers and training set.
Even with the same rules and regulations, it’s much more complex when sending data off a device.

Now, look at Apple's 2025 proposal:

There's no extra computer system needed (aside from your phone).
You don't pay per usage.
There are relatively small ongoing costs for Apple to run it.
Your privacy is complete because your data stays on your device. A less complex case for GDPR, CCPA… makes it much simpler to follow rules and regulations.

When investors are worried that Apple is behind in AI, the developers are quietly discovering something truly amazing. This completely changes how developers think about adding AI to their apps.

When the cost of using AI becomes near zero, it opens up a whole new world of possibilities for apps. Features that would be too expensive to build with cloud-based AI suddenly make perfect sense.

Unlike Meta, their true advantage is that they care much less when AI bullshit you. Instead, hallucination is precisely what they need to further profit from their already dominating advertising empire. I’ve covered the analysis here:

The Dawn of the Ultra-Tailored Ad Era.

Jing Hu and Klaas Ardinois

Mar 28

Read full story

Equally, Apple finally (a few years later than other players) realized their true unfair advantages should come from utilizing the most of GenAI's capability on top of what they have already done best: the Apple ecosystem.

We all know that once you're in Apple's ecosystem, the switching costs are enormous.
Think about what the 34 million developers do with their AI to serve the billions of users in their App Store
They profit from iPhone sales, the App Store monopoly (for now), not AI usage. So, giving away AI for free (again, because on-device, so low cost) will further drive hardware sales.
On-device processing privacy could be a marketing line, yes, also a genuine differentiator, privacy as a feature.

The moat is structural (and again… same for Meta).

Apple can afford to give away AI because it controls where and how you use it.

The Rest of the AI Industry is in Denial…

To be honest, Apple's findings aren't new.

Many researchers have pointed out these limitations before. So why does the rest of the industry continue to pretend they don't exist? Why do they keep pushing the same narrative?

The answer might be simple, and a little sad:

Too much money depends on the illusion of progress.

There's a huge financial incentive to keep the hype going, even if the underlying technology isn't living up to the promises.

When businesses and everyday users realize their 'AI agents' can't handle complicated tasks, when the promised financial returns don't actually happen, when the big productivity gains don't show up …

What happens to the whole AI craze?

The $75 Billion Panic

The drop in Apple's stock value shows that the market completely misunderstood where AI is really going.

Investors see Apple's careful approach as falling behind because they were expecting big, flashy announcements, promises of super-smart AI, and demos that blow them away.

But Apple's research, and their quiet, steady approach, might just be the smartest move in the long run. Here’s what they say about the Apple AI strategy.

They're (finally) building for reality, not just for hype.

Know someone who plans to short Apple and needs a reality check about AI? Be a friend. Share this.

Five Keys In Your Pockets For The Path Forward.

Important!

Before you close the tab…

Apple's approach gives many a clear plan for building AI that actually works. Here are some lessons I learned, which I believe are applicable to all companies that are eager to implement AI for their business.

You better stop getting AI can do things it can't! (Good for your business) Acknowledge the limitations. Build products around what AI is actually good at, not what you hope it will be good at someday.
AI is great at finding and generating from well-knonw patterns but terrible at complex thinking, so build apps that rely on patterns, not ones that try to act like human executives. You should work with the limitation. Using smaller, focused AI models efficiently is much better than using huge, general models that don't work well.
Providing the promise of privacy and data security is a huge advantage in a world that increasingly cares about personal data.
Instead of burning money on chasing the latest model that can't last, find ways to deliver value that also make sense financially for a sustainable business model.
Design solutions that allow switching out models quickly; don’t get tied into one version. So you can benefit from improvements if they are relevant to your problem.

I hope you enjoy this unique viewpoint ;) Speak soon.