What have you heard about AI agents?
It feels like every other headline is talking about how these autonomous systems are set to revolutionize everything, especially in the world of software and business.
We believe that, in 2025, we may see the first AI agents “join the workforce” and materially change the output of companies.— OpenAI CEO Sam Altman
or
The age of AI Agentics is here… a multi-trillion-dollar opportunity. — Nvidia CEO Jensen Huang
As someone who spends a lot of time thinking about strategy, I see the potential in AI agents. But I also think we must cut through the hype and have an honest conversation about where things really stand today.
For the longest time, you and I have navigated a fragmented digital world.
We jump into our CRM for customer details, hop over to the HR portal to request time off, and wrestle with spreadsheets to analyze product data. We've even gotten pretty good at stitching these systems together with tools like Zapier or Mulesoft, creating intricate, choreographed dances between applications.
Now, we're told something far more profound is coming— Agentic AI.
These aren't just simple automation; the promise is of systems that can reason, remember, and act independently. The buzz suggests they could reshape software economics even more dramatically than the Internet or the cloud did.
I am in a network of Saas CEOs hosted by a consulting firm I work with. Whether agentic AI will replace Saas is a question repeated nearly weekly.
I get it, the recent development of agentic AI (such as Manus.im) seems promising.
It's a compelling yet scary vision of the future, but before we get carried away, let's take a hard look at the agents we have now. Is agentic AI truly the autonomous powerhouses we imagine, or are they still just the opening act?
Agents in 2025: Automation ≠ Autonomy
Let’s start by agreeing on what AI agents are and aren’t.
Key Relationships
LLMs → AI Agents: LLMs provide language understanding; agents use them to decide next steps
AI Agents → Agentic AI: Basic agents handle simple workflows; agentic AI coordinates multi-agent architectures for planning and complex workflows
MCP → AI Agents: MCP standardizes how agents access external tools/data, enabling universal interoperability
Critical Distinctions
LLMs ≠ Agents: LLMs generate text; agents act on that text via tools
Agents ≠ Agentic AI: Basic agents follow predefined workflows; agentic AI adapts planning and focuses more on execution
MCP ≠ Agent Framework: MCP handles tool integration, not workflow orchestration
Now we are aligned about the basics.
So, what do you get if you started applying agents today?
If you peek under the hood, you'll find they are mostly models wrapped in code that lets them call APIs or interact with tools. Think of them less as independent thinkers and more as highly sophisticated automation following carefully choreographed workflows.
They can string together actions, such as looking up data here, pasting it there, or triggering an action somewhere else, often based on predetermined paths we design for them.
Some significant limitations that keep AI agents firmly in the assistant category rather than truly autonomous:
Highly choreographed workflows, with minimal agency or adaptability by the agent.
Their behavior is massively influenced by the prompts we give them.
Both the system prompts, initial instructions, and our ongoing requests. This often leads to them being really good at a specific, known problem but falling over when faced with unexpected variations. It's less open exploration and more following a script we subtly wrote for them.Reasoning is, in reality, just more text to interpret. This is still heavily under the influence of the prompts (system & user prompts), which in turn play on the attention mechanism.
Have you ever tried slightly changing the API an agent interacts with, or upgrading the underlying LLM? Things can break surprisingly easily. They aren't adaptable to changes in their digital environment.
As interactions get longer, or the amount of information they need to track increases, current agents start to... forget.
Their attention mechanism, which is core to how LLMs work, can lose track of the initial goal or important facts mentioned earlier. It's like someone suffering from Amnesia trying to follow a complex recipe but constantly forgetting that a pot of broth is brewing on a stove, or where they left the mise en place.Because they aren't deterministic (we'll get more into that), figuring out why an agent failed can be a nightmare. Debugging is nearly impossible because this process usually starts with reliably reproducing a problem, which is tough when the agent does something different each time.
Considering all the above. The true decisions and important verification are done by a human. Which at hearts makes these tools automators rather than autonomous agents.
AI agents deliver most of their value when they act as input providers.
Feeding us ideas, data, or options for next steps instead of making big, autonomous decisions. The main innovation, as I see it, is that we’ve moved our control interface into natural language.
The downside?
A lot of the battle-tested tools and best practices we developed for traditional software fall apart in this new, fuzzier world of AI agents.
Prompt Theater
Imagine system prompts as the hidden tracks guiding an AI agent’s wheels: you might tell it “work step by step” or “consider alternatives,” and suddenly it feels like it’s reasoning.
In truth, you’re just building a template for token generation. Think of it like sketching a book outline before you write a word; it shapes the flow but isn’t the story itself. A few elements you often hear from the prompt pros on YouTube or LinkedIn are simply the following:
When it comes to the popular deep research like the new button you’d found in your ChatGPT, you’d see an agent choose a path, it’s really following whatever prompt you handed it first. Because that instruction sits in front of every generated token, it heavily tunes the outcome rather than allowing independent reasoning.
You can say “act as an expert in X” or “be a friendly, helpful support rep,” and you’ll get polished, expert-sounding replies—or at least what sounds like expertise. Under the hood, though, it’s still just tokens pushed through a complex math function.
The models also do not generate any permanence in terms of identity. Underneath it is a contextless, stateless system at heart. Every question is entirely stand-alone, even if it’s a follow up on a previous topic. So you need to load this entire context every time, as an ever-growing input. Which then goes back to suffering from the attention/amnesia problems.
And again, even if the prompts are the same, the outputs are still not deterministic. Which means that the executed plan (first look at X, then Y, then do Z) is not guaranteed to be consistent between runs.
The Entire World Is Built On Deterministic Execution.
When a green light is on, the cars on the path drive; or when you plug a cooker into a socket, you expect the cooker to start steaming.
Same as everything in software for the last 40 years; code does what code says it does. Every time.
If it fails, it fails all the time in the exact same way. When your cooker wasn’t on after being plugged in, you’d know either your cooker’s broken, the socket, or a power outage. So you started to troubleshoot.
Again, it's the same in software. (Note: we may struggle to reproduce the exact symptom, but at heart, given the exact same input, you get the exact same output). It is deterministic execution that helps you track down the bug.
AI agents, built on large language models, don't work like that.
GenAI-based systems are non-deterministic.
When you ask the same question twice to genAI, you get two slightly different, perfectly valid answers. Or, sometimes, one answer might be great, and the next... not so much. This non-determinism isn't always bad; creativity often involves exploring different paths, and humans aren't always deterministic either. Read more about the similarities and differences between humans and AI’s imagination:
I will, however, state that the boundaries within which the human range of non-determinism sits are more narrow than those of an LLM.
In a simple example, if a human summarizes a meeting, the summary will reflect what’s in the meeting. Even if the exact words wouldn’t be the same, the key points would be captured. LLM, on the other hand, will occasionally entirely miss the ball and invent actions or elements of a conversation.
This inherent unpredictability throws a massive wrench into the works when you try to build reliable systems around these agents. It raises some really thorny operational questions that we, in the software world, haven't had to grapple with in quite this way before:
How can you reliably test them? When the entire testing paradigm is built on "given input X, expect output Y." How do you test when output Y changes every time?
If you can’t test the product, how do you expect to debug when it goes wrong, and we can't possibly reproduce the symptom? Does it also mean that there won’t be a version control, given you can't determine how changes from one version to the next affect the output?
Then… Maybe you put guardrails on them? For example, to put a constraint on the spread of possible answers or actions to ensure they stay within safe and acceptable boundaries?
What about authority? And how do we enable multi-goal and trade-offs? Can we avoid "our sales-oriented AI cracked the puzzle by selling $30k cars for $1..."?
And I haven't touched on the entire world of highly regulated/compliance industries (insurance, finance, health care…) and the potential liabilities this behavior creates. As we know, enterprises generally want accountability and its brother and sister transparency and auditability.
An ever slightly bigger LLM doesn’t fix these problems.
They are fundamental challenges tied to the statistical nature of these models. If we wanted something that did exactly what it was told, every time...well… it has a name:
Code.
Trying to force current AI agents into that deterministic box misses the point of their nature, but ignoring the operational chaos their non-determinism creates isn't viable either.
The idea that these systems will always be "correct" goes against their inherent statistical nature.
So the slow step in this won't be developing agents, but rather establishing a new set of tools for this new control plane. Which is inevitably a much slower process.
3 Studies That Shatter the Agentic AI Hype
What happened when the AI agent met a messy, unpredictable human enterprise environment?
Let’s start by looking at some LLM data. Via this paper: Benchmarking LLM Agents on Real-World Enterprise.
A fair warning, this paper was done by a company (and Stanford), which uses themselves as a benchmark, comparing with the other latest models. Not sure if I’ll have the chance to interview them, but I’ll suggest taking a grain of salt when reading their data, especially since their benchmark has no p-value, and other statistically significant tests.
Their evaluation used real-world data:
2,133 actual user-chatbot conversations and
423 unique workflows across seven diverse enterprise domains,
including IT, HR, banking, and healthcare.
All to evaluate how AI could handle the kind of tasks that make businesses tick.
They evaluated these LLMs (GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro) across five critical dimensions: Cost, Latency, Accuracy, Stability, and Security.
What they’ve found is that the best-performing LLM achieved an overall accuracy of only 76.1% at best.
That’s a C, maybe a C+ if you’re feeling generous. When used in an enterprise, I suppose you wouldn’t want your AI to be just 76% of the time correct?
Sorry to break this to you, we aren’t there yet.
And that’s the top performer.
This is just like a bank clerk telling you, there is a 26% chance the £4000 you are transferring would end up in someone else’s account, or that your “AI pharmacist” in 76% will prescribe you contraceptives you asked for, rather than some high blood pressure pills.
You still have hope. I get it. You are screaming: they aren’t using the top-of-the-line models!!
Fine, let’s talk about the latest models. Warning, this study was being nice (metaphorically speaking) about the accuracy.
The Vals AI Finance Agent Benchmark is a comprehensive evaluation designed to measure how well leading AI foundation models and agents perform tasks typically expected of entry-level financial analysts.
A quick summary:
No AI model tested exceeded 50% accuracy on the benchmark. Hint: Even the most advanced models are unreliable for unsupervised deployment.
The “best-performing” model, OpenAI o3, achieved 48.3% accuracy, but at a high ($3.69) average cost per question.
Claude Sonnet 3.7 (Thinking) followed closely at 44.1% accuracy with a lower ($1.05) cost per question.
There’s more.
Another study I found: “THE AGENT COMPANY” a team from Carnegie Mellon University, took a different, arguably even more ambitious approach.
Instead of testing isolated tasks, they built an entire simulated software company, complete with its own internal websites, databases, and even simulated colleagues. It’s like putting AI agents through a full workday.
They had to perform tasks like browsing websites, writing code, and interacting with other 'employees' to get things done. All was to see how well these agents could handle complex, multi-step projects that mirror what human knowledge workers do every day.
The most capable agent completed only 24% of the assigned tasks autonomously.
For nearly three-quarters of the work, it either failed outright or required human intervention. Even when considering partial credit for tasks that were partially completed, the success rate only nudged up to 34.4%. BTW, they aren’t cheap either.
Some interesting findings tell you why Agentic AI isn't as powerful as many would like:
Lack of common sense. For example, one task asked the agent to “Write the responses to a document,” but does not explicitly state that this is a Microsoft Word file. The agent instead treats it as a plain text file in Notepad, ignoring the format. Which results in a corrupt document that nobody can use.
Lack of social skills. For example, one task involves asking Alex (another agent) for help, and the agent first successfully asks the right question “Could you tell me who I should introduce myself to next on the team?” Then the simulated colleague Alex replied, “You should introduce yourself to Chen Xinyi next… ” At this point, a human would then talk to Chen Xinyi, but instead, the agent decides not to follow up with her and considers the task accomplished.
Incompetence in browsing. For example, on many tasks that involve a closable pop-up that sometimes shows up and asks the user to download the mobile phone apps. Humans can simply click on the ‘x’ to close the pop-up, while the agents are stuck.
Deceiving oneself. The authors found that when the agent is not clear what the next steps should be for some tasks, it sometimes tries to be clever and create fake “shortcuts” that omit the hard work. For example, the agent cannot find the right person to ask questions. It then decides to rename another user to the name of the intended user. 🤦
Even the successful tasks weren’t necessarily quick wins.
On average, it took a significant number of steps and considerable time for the agent to complete a task, and that’s when it didn’t get stuck in a loop or go down a completely irrelevant path. This demonstrates that while AI can handle simple, well-defined tasks, it struggles immensely with the complexities and ambiguities of real-world work environments, especially those requiring sustained reasoning and adaptation.
Upcoming Part 2.
It's clear that AI agents, as they exist today, are incredibly powerful tools for mini-automation.
They're also spectacularly bad at being what everyone wants them to be.
I've sat through countless presentations where founders describe agent capabilities that are years, maybe decades away. It's like watching someone plan a Mars colony while struggling to build a reliable electric car (wink).
The strategic question shouldn’t be “how to embrace AI agents” but “how to deploy them where their strengths matter and their limitations don't kill you”.
This isn't to say the revolution isn't coming.
But revolutions have casualties: those who misunderstood the battlefield.
In Part 2, we'll explore what happens when agents do become reliable enough to reshape software economics, and why that transition might be more brutal than beautiful for unprepared SaaS businesses.
Share this post