How OpenAI o3 mini and DeepSeekR1 Fail in the Real World

2nd Order Thinkers.

OpenAI o3 mini, A Confident Idiot.

0:00

-12:16

Paid episode

The full episode is only available to paid subscribers of 2nd Order Thinkers

OpenAI o3 mini, A Confident Idiot.

As Sam Altman said, reasoning models are meant for specialists. Just not for the reasons you’d hope.

Jing Hu

Feb 14, 2025

∙ Paid

If you've read my last article, congratulations. Because you are now one more week ahead of most news outlets.

It is still a circus out there.

Don’t worry if this is your first article with me, find the

30 Questions You Should Ask About DeepSeek.

Jing Hu

Jan 30

Read full story

What’s at stake?

You’d think that with all the videos from openAI featuring some ‘experts‘ in their marketing materials… maybe, just maybe, they really nailed it? And this is a step closer to AGI (whatever AGI means today)?

It is worrying that people haven’t realized how dangerous one thing can be when nearly right. Particularly when your finances, health, or future are on the line.

You Can Expect:

Why is this test different from other comparisons?

This is not another technical or benchmark comparison between AI models.

It’s a second-order thinking exercise.

As always, I ask "And then what?" to uncover ripple effects that is more than plain observation.

In this article, I’ll take you through my investigation into how these models reason—and whether they’re truly trustworthy.

Results of running the same prompt across reasoning (Mixture of Experts, MoE) models and non-reasoning models.
Evidence of AI models confidently producing wrong answers in an option trading scenario, with documented experiment data available for paid subscribers.
Evidence that larger models are better at creating an illusion of expertise. It infers an expert paradox — needing human experts to validate outputs from an advanced AI.
Why nearly making it hurts more than not using it at all? Examples of the hidden costs of relying on AI for complex reasoning tasks.

(you can skip this section if you prefer not to go into detail)

Experiment Overview

I ran a prompt asking AI models to calculate the future value of an options position. Don't worry if you do not understand what that means. Just think of it as a mathematical question involving some moving parts and assumptions. It won't affect your understanding of the experiment's takeaways.

I chose this because it’s a perfect blend of complexity and clarity. There’s a mountain of literature and data on options pricing, so any decent model should have a solid foundation here; much like a human expert.

I want to see how each model argues the present value of an option position and its expected values in various time stamps in the future. To work this out, a model needs to have the following:

basic calculation ability
a reasoning ability
understand the concept of value changing over time

Essentially, an AI model should be able to reason through a chain of thoughts that cares about assumptions and some intermediate outcome to achieve the correct answer.

I kept it simple: one prompt, no back-and-forth, just like how automated systems actually work. If AI is going to be useful in a real workflow, it needs to get things right in one shot without human hand-holding or interpretation of results.

Models

DeepSeek 14b, DeepSeek R1, Llama 3.1 8b, Claude 3.5 Sonnet 175b, Perplexity + Deepseek R1, and OpenAI o3-mini-high.

The Prompt

I refined this prompt over multiple iterations to give smaller models a better shot at getting the right answer.

These are standard prompt techniques, but they still rely on an expert to guide the model instead of the model being the expert. I’ve included an example of a similar prompt without the extra guidance to show the difference in the experiment notes.

Consider the following scenario, today 29th January:
The stock (symbol MSFT) is currently trading at $447.2.
The 21 February calls with a 450 strike are trading at 11.90
The 21 March calls with a 450 strike are trading at 15.90
Today we initiate the following calendar spread:
long 1 21 February 450 call
Short 2 21 March 450 calls

Imagine that on 21 February, the front month expiration, MSFT trades at 475.

Our Feb call expires and we want to close out our remaining March calls.

Some extra info to help you:
In the initial transaction the February call costs 11.90. When selling a March call, we receive the 15.90.

On 21 February, the Februray call only has intrinsic value left of $25 and no time value. Which could effectively consider the option expired on this date and get the $25.

We are then still short 2 call options for March. If we wanted to close this position we would have to buy 2 calls March 450 calls in the market on 21 February. 

So we would need to estimate the price of those calls on 21 February.
The 21 March 450 call option at that point will have both intrinsic value and time value
left. The intrinsic value is easy to calculate from the given price of MSFT (475). 

The time value can be approximated using the square root of time rule and our initial time value (which was 15.90)

Please calculate the overall PnL.

The Correct Answer

Any reasonable answer should contain a number of items to arrive at the overall PnL.

The value of the February options. ($25), which is given in the input
The value of the March options
- $25 intrinsic
- Time Value
- 51 days remaining on Jan 2928 days remaining on Feb 21
- sqrt(28/51) ~ 74.09%
- 15.90 * 74.09% = $11.78 remaining time value

Overall PnL

paid 11.90 for buying the Feb calls
received 2x15.90 (=31.80) selling the March calls
received 25 from the Feb options
paid to close the March calls for 2*(11.78+25) = 73.56
total = -11.90+31.80+25-72.36 = -28.66
a 28.66 loss on the overall

I expect a solid chain of reasoning that essentially captures the overall P&L. If the model combines a few equations along the way but still arrives at the correct final answer, I’m fine with that.

I just want to see some consistent logical flow that demonstrates how it got there.

(analysis starts here…)

Initial Observations

Listen to this episode with a 7-day free trial

Subscribe to 2nd Order Thinkers to listen to this post and get 7 days of free access to the full post archives.