If you've read my last article, congratulations. Because you are now one more week ahead of most news outlets.
It is still a circus out there.
Don’t worry if this is your first article with me, find the
What’s at stake?
You’d think that with all the videos from openAI featuring some ‘experts‘ in their marketing materials… maybe, just maybe, they really nailed it? And this is a step closer to AGI (whatever AGI means today)?
It is worrying that people haven’t realized how dangerous one thing can be when nearly right. Particularly when your finances, health, or future are on the line.
You Can Expect:
Why is this test different from other comparisons?
This is not another technical or benchmark comparison between AI models.
It’s a second-order thinking exercise.
As always, I ask "And then what?" to uncover ripple effects that is more than plain observation.
In this article, I’ll take you through my investigation into how these models reason—and whether they’re truly trustworthy.
Results of running the same prompt across reasoning (Mixture of Experts, MoE) models and non-reasoning models.
Evidence of AI models confidently producing wrong answers in an option trading scenario, with documented experiment data available for paid subscribers.
Evidence that larger models are better at creating an illusion of expertise. It infers an expert paradox — needing human experts to validate outputs from an advanced AI.
Why nearly making it hurts more than not using it at all? Examples of the hidden costs of relying on AI for complex reasoning tasks.
(you can skip this section if you prefer not to go into detail)
Experiment Overview
I ran a prompt asking AI models to calculate the future value of an options position. Don't worry if you do not understand what that means. Just think of it as a mathematical question involving some moving parts and assumptions. It won't affect your understanding of the experiment's takeaways.
I chose this because it’s a perfect blend of complexity and clarity. There’s a mountain of literature and data on options pricing, so any decent model should have a solid foundation here; much like a human expert.
I want to see how each model argues the present value of an option position and its expected values in various time stamps in the future. To work this out, a model needs to have the following:
basic calculation ability
a reasoning ability
understand the concept of value changing over time
Essentially, an AI model should be able to reason through a chain of thoughts that cares about assumptions and some intermediate outcome to achieve the correct answer.
I kept it simple: one prompt, no back-and-forth, just like how automated systems actually work. If AI is going to be useful in a real workflow, it needs to get things right in one shot without human hand-holding or interpretation of results.
Models
DeepSeek 14b, DeepSeek R1, Llama 3.1 8b, Claude 3.5 Sonnet 175b, Perplexity + Deepseek R1, and OpenAI o3-mini-high.
The Prompt
I refined this prompt over multiple iterations to give smaller models a better shot at getting the right answer.
These are standard prompt techniques, but they still rely on an expert to guide the model instead of the model being the expert. I’ve included an example of a similar prompt without the extra guidance to show the difference in the experiment notes.
Consider the following scenario, today 29th January:
The stock (symbol MSFT) is currently trading at $447.2.
The 21 February calls with a 450 strike are trading at 11.90
The 21 March calls with a 450 strike are trading at 15.90
Today we initiate the following calendar spread:
long 1 21 February 450 call
Short 2 21 March 450 calls
Imagine that on 21 February, the front month expiration, MSFT trades at 475.
Our Feb call expires and we want to close out our remaining March calls.
Some extra info to help you:
In the initial transaction the February call costs 11.90. When selling a March call, we receive the 15.90.
On 21 February, the Februray call only has intrinsic value left of $25 and no time value. Which could effectively consider the option expired on this date and get the $25.
We are then still short 2 call options for March. If we wanted to close this position we would have to buy 2 calls March 450 calls in the market on 21 February.
So we would need to estimate the price of those calls on 21 February.
The 21 March 450 call option at that point will have both intrinsic value and time value
left. The intrinsic value is easy to calculate from the given price of MSFT (475).
The time value can be approximated using the square root of time rule and our initial time value (which was 15.90)
Please calculate the overall PnL.
The Correct Answer
Any reasonable answer should contain a number of items to arrive at the overall PnL.
The value of the February options. ($25), which is given in the input
The value of the March options
$25 intrinsic
Time Value
51 days remaining on Jan 2928 days remaining on Feb 21
sqrt(28/51) ~ 74.09%
15.90 * 74.09% = $11.78 remaining time value
Overall PnL
paid 11.90 for buying the Feb calls
received 2x15.90 (=31.80) selling the March calls
received 25 from the Feb options
paid to close the March calls for 2*(11.78+25) = 73.56
total = -11.90+31.80+25-72.36 = -28.66
a 28.66 loss on the overall
I expect a solid chain of reasoning that essentially captures the overall P&L. If the model combines a few equations along the way but still arrives at the correct final answer, I’m fine with that.
I just want to see some consistent logical flow that demonstrates how it got there.
(analysis starts here…)
Initial Observations
First, let’s recognize that no model got to the correct answer.
The first Perplexity + R1 run came closest in logic but got the remaining days wrong (a common issue in several responses—date math is hard, apparently), which led to an incorrect overall number. The second try was also wildly off, as it ignored the initial credit from selling the March calls.
Even w/o the Perplexity front, all smaller DeepSeek models have reasoned themselves so far off track that the final answer had little relation to the actual problem in various degrees. While they followed their usual pattern of backtracking and questioning previous steps, they ultimately converged on solving an entirely different problem not included in the original instructions. In one case, DeepSeek got confused by a big up move in the stock, leading to an overall loss.
OpenAI o1 and o3-mini-high aren’t so much better. To an extent, they are more worrying. o3-mini-high reasons relatively well, especially when given proper prompts. However, while the reasoning seems logical at a glance, Devil in Detail makes some laughable but subtle math mistakes like rounding up 0.649 to 0.66…
Llama made fundamental reasoning mistakes, ignoring the square root time decay mentioned in the prompt. Instead, it confused intrinsic value with total value and worked from there.
Claude, on the other hand, did a better job but also got the remaining days wrong. Aside from that, it followed the correct approach. Interestingly, when asked to correct its own answer, it identified the mistake in the March call analysis—but then repeated the exact same calculation, leading to the same incorrect result.
Yes, these models now ‘think‘ and ‘reason.‘ I think this is the start of a serious problem. Why? Because a user needs to be increasingly critical (and likely an expert) to correctly validate and interpret the response.
Reasoning Models Can't Reason
These reasoning models were so confidently wrong that I half expected it to start pitching AGI timelines like Altman.
Here's what Sam Altman said about the o3-mini and o3-mini-high. Since when ‘think harder’ is now qualified as a product feature?
Both DeepSeek R1 and o3-mini generated detailed reasoning steps, but failed to reach the correct answer.
This finding aligns with research showing LLMs become more self-assured when forced to explain their logic—even if flawed. I want to emphasize that:
Current evaluations (MMLU, AIME) test narrow STEM skills but miss real-world complexity.
Reasoning ≠ Understanding.
Some people started to show AGI mirage symptoms.
While o3-mini may represent progress, failure on your question shows no model yet meets the "broad, adaptable intelligence" threshold of AGI.
Quantization
As a general rule, more parameters at reduced precision tend to outperform fewer parameters at higher precision.
Models store their parameters in high precision, which takes up a lot of memory space.
Reducing precision (from 32/16-bit values down to 2-bit integers at its most aggressive) has a huge impact on memory footprint.
I’ve found that 8-bit and 4-bit quantized models usually hit the sweet spot. Achieving 85-95% accuracy compared to full-precision models while using a quarter or less of the memory footprint (video memory, in this case).
Perplexity
How does Perplexity work?
Overly simplified, it wraps its own intelligence and processing around the LLM and feeds that into the LLM to achieve better outcomes.
In my example prompt, it started by writing and executing several short Python snippets for calculations. However, it was incorrect to start with.
The script and results were then fed into DeepSeek alongside the original prompt.
During DeepSeek’s reasoning process, it analyzes the code to assist its logic but still often arrives at a different outcome and questions the input data.
With the incorrect Perplexity's original input, the problem is even harder for DeepSeek to get it right. So, I was not surprised that DeepSeek ultimately agreed with Perplexity’s calculated values, presumably because, from DeepSeek’s perspective, they were part of the input.
It’s unclear exactly how Perplexity instructs DeepSeek, but it likely includes language that makes the model prioritize the input data over its own reasoning.
Prompting
I accept that my prompt was far from perfect… if such a thing as a perfect prompt even exists. Let alone one that would seamlessly work across all LLMs.
I did not set any system prompts (when running locally) or include other preambles when using Claude/Perplexity for the captured runs.
In other trials, I experimented with role assignments in the prompt, layering it with references to options theory (Black-Scholes) and practical shortcuts. However, none of these changes made a meaningful difference—the mistakes remained the same as those shown in the full research.
While this is only one example and not an exhaustive test, a few things are already clear.
Commerical Use? Examples and Pitfalls
Maybe it is my lack of imagination.
Neither of us sees any commercial application that can be built with reasoning models w/o an enormous amount of engineering time and additional monitoring costs.
Keep in mind that I’m not talking about generative content when considering integrating a reason model as part of a bigger system.
Example 1. CRM
LLM outputs can be highly variable in both format and structure, so you’ll likely need to do some post-processing. Imagine you’ve transcribed a candidate’s job interview, and you want to automatically pull out basic data: their desired salary, whether they have the right to work in your country, their start date, or their notice period.
You could prompt an LLM to figure these things out—then try to constrain the response to a specific format (like JSON) so you can pass it along to an API for updates in your CRM or other processing.
But you do need a well-defined structure. LLMs can be finicky about reliably producing the exact format you want, so you’ll still need to double-check or fix the output.
Take salary expectations, for example. Will the model return a number like 75000, or 75,000, or £75,000? Maybe it spits out “75k,” or leaves out the “k,” or just writes “75.” A human can read those and understand what was meant, but from a coding standpoint, you have to handle all that variation, or your system could break.
Example 2. Insurance Payout
Let’s say an insurance company decides to use a reasoning LLM to automate part of its claims process. The idea is that customers submit descriptions of their accidents, and the LLM will parse the details, evaluate liability, and suggest a payout or rejection.
Sounds great in theory. In practice, though:
Again, you need to have input constraints. The system needs consistent input—specific fields like incident date, vehicle details, policy numbers, etc.
Yes, you can experiment with more precise or creative prompting to push the LLM toward a consistent format, but there’s no guarantee you won’t still get oddball results. In reality, you’ll likely need to handle all edge cases. Some libraries attempt to tackle this, and from test runs, it’s obvious how much variety an LLM can produce.
If customers word things differently (which they always do!), you don’t want your LLM to misinterpret a claim or ignore an important piece of info.
Yes, you can use forms like all insurance companies now, but wait, I thought you wanted LLM to do the work?
Output constraint is also a must. LLM output must fit into your claims database system. You might want a JSON structure that says “liability = 70%,” “suggested payout = $5,000,” etc. However, the model could spit out a narrative paragraph or mix up the fields. You’ll need error-checking, fallback logic, or re-prompting cycles. That’s more engineering work.
Even with all these guardrails, the model still hallucinates or produces contradictory judgments.
Now, you need people reviewing flagged cases. Obviously, the bigger models do a better job but are still wrong. However, their answers appear more credible. Therefore, you don't need just anyone; you want someone with legal or compliance training. This quickly adds up. The more subtle an LLM can fail, the return on investment in monitoring will drop very soon.
It is guaranteed that there will be legal liability. Just think about the lawsuit against Character.AI. The resulting legal costs (and brand damage) can outstrip any savings you get from using the LLM in the first place.
And the cost list goes on.
It’s often far more expensive and riskier than people assume when people first think, I’ll just plug in an AI and handle claims automatically.
Last Words
Let’s be honest—the hype is huge, but the reality is a circus. Same input. Wildly different outputs. There is no such thing as the summary, the image, or the code.
It’s almost like magic. Not the good kind, either.
That's why you see news like an AI chatbot agrees to offer a car for just $1.
You might dream of a plug-and-play solution. In practice, you’re forking out serious cash for engineers to babysit your system. You’ll spend more time cleaning up messy data than actually using it. And then there’s the endless monitoring. Oh, and don’t even get started on the legal nightmares.
In the wild world of LLMs, understanding is still a distant dream. So you’d want to wait before throwing money into a solution.
Share this post