2nd Order Thinkers
2nd Order Thinkers.
AI Agents Problems No One Talks About.
5
5
0:00
-35:07

AI Agents Problems No One Talks About.

Memory like goldfish, goals like broken GPS, and feedback loops that would make your worst coworker look brilliant.
5
5

Despite the bold proclamations from tech CEOs about AI agents "joining the workforce" and creating "multi-trillion-dollar opportunities," today's AI agents fall short of true autonomy.

If you missed part 1 of this series, catch up here:

What we actually have are sophisticated chatbots wrapped in fancy code. They follow scripts, not instincts. Feed them the same task twice; you might get completely different answers. Good luck tracing the logic when trying to debug.

The numbers tell the real story (Via this paper: Top of the Class: Benchmarking LLM Agents on Real-World Enterprise), and they're sobering:

  • Best models hit 76% accuracy on enterprise tasks (would you accept a 1-in-4 failure rate?)

  • Under 50% on financial analysis work (imagine explaining that to your CFO)

  • Only 24% of task completion in realistic workplace simulations

I've been watching this space closely, and here's my take: AI agents are phenomenal brainstorming partners but terrible decision-makers.

Think of them like that brilliant but widely unreliable colleague who generates amazing ideas randomly in meetings. You'd never trust them to close the deal alone. They're input providers; you should own the output (and outcome).

Instead of rushing to hand over the keys to AI agents, brilliant leaders like you, ask the harder question: Where can these tools solve existing problems without creating chaos?

The revolution may be coming in the near future. But right now, those betting everything on autonomous AI agents are playing with fire.


The “Bright” Future of Saas and AI Agents

Suppose we see agentic AI as a reasoning framework that can navigate ambiguity while maintaining a coherent goal in mind.

In that case, here is what we can expect in the future:

  • It can break down high-level goals into executable sub-goals without human guidance (or on system prompts), while maintaining goal consistency

  • It understands unspoken requirements and context, or is aware of the context under which a request is received (or even who makes the request and what that implies)

  • Not just raw memory storage, but the ability to evolve conceptual understanding of domain models and stakeholders.

  • It knows how to operate inside boundaries, when to proceed autonomously, or when to seek clarification

  • Multi-context reasoning. It can articulate (and make) decisions based on tradeoffs between technical/operational/business/user (and even ethical) dimensions

In practice, you'll likely have many of these trained on particular domain knowledge, potentially even down to a full agent model trained on your proprietary knowledge that has never been in the public domain.

I'll come back later to what needs to be true for those agents to exist.

Weekly contract: I dive deep, you click once. I'm starting to think I'm bad at negotiating.


From A Customer Perspective

Let's assume all I’ve listed above are real, and what that means for the current crop of SaaS businesses.

We've all been told that a good SaaS business is one that uniquely solves a problem for a user. And the economic efficiencies sit in the software model, where marginal costs for new users are low.

The value exchange is your user renting a solution from you for a fraction of the cost. The cost may be an in-house built and maintained system.

And it all gets hosted centrally in a multi-tenant setup with a vendor controlled update/improvement cycle and accessed over a public internet connection, taking the operational hassle of running live systems away.

SaaS applications generally fall into a few categories when you first start.

And as you expand, you might end up spanning a few.

So, just to help frame things a bit, these are the typical problems solved by SaaS tools, from the perspective of their customers.

I use at least 10 Saas to run this newsletter. Like Google, Slack, Dropbox, Canva, 1Password, just to name a few.

That means I am paying for a lot of SaaS tools to hold fragmented views of my data.

I'm paying for my right to manage that data (through seat licenses), and/or for the amount of data they hold.

Here’s the thing: no matter how much AI each of them adds to their solution, they're not making my life much better.

Why?

Sure, Databricks' AI dashboard builder lets me ask for a pie chart rather than drag the widget and configure it. Big whoop. Or that Gemini now generates a document with a summary of what we talked about and tries to figure out any action points.

Did your customers really need this loose-to-none connection?

None of these hit the fundamental problem that Agents pose for them.

I (likely your clients, too) want the agent to know everything all at once.

I want the Agent to be able to understand that when George talked about "prepare me for today's meeting".

Yes, it had a lot to do. In fact, it had to understand that

  1. George works in Sales, from an HR system.

  2. Today's meeting, from a calendar event.

  3. The meeting is a Teams call, so it can make sure the meeting recorder joins.

  4. The meeting is with Jane from CocaCola, a customer data from a CRM.

  5. And that he needs the first sales call script & matching slides, from a document storage.

Oh, and when Sally asks, "Prepare me for today's meeting," it's actually three engineers chatting about a problem they discussed on Slack a few days back.

Of course, that's a stereotypical "Siri on steroids."

But you could apply that same lens to business processes. How often does a business process, from start to finish, live entirely in one SaaS tool?


When All SaaS Collapses

Let me entertain you a little if you want me to spell out the collapse of SaaS.

It'll be slow at first and then suddenly.

While that UI might be a pleasant place for humans to get something done and be a differentiator, an agent doesn't really care that much.

Needing to click around because I can never remember where exactly that information lives in a system, or which system exactly it is? Agent got it.

Need a full deep dive on a key customer activity? Have fun copy and pasting data from Zendesk, Salesforce, QuickBooks, product usage dashboards, call notes (if they exist at all...).

It might look like OpenAI's operator is using screen recognition to complete an action or pull some data. Never mind that you've lost the human interaction.

Breeze for an agent, be on your desk in 10 minutes.

I'm sure you get the point by now.

The crux is that these future Agents hurt most SaaS players at their core.

The UX Sunset?

That's a world where SaaS no longer owns its UI/UX. What is worse, your beautifully researched and designed UI/UX will be the burden over an API or MCP (or any interface optimized for data exchange between software systems, rather than software <> human).

A world where fixed automated processes and pre-defined dashboards are irrelevant, as agents can dynamically push and pull data and summarize.

All of this will lead to standardization.

The winners will initially be those who interact best with agents.

Rather than the current best-of-breed approach of selecting the best vendor in a category, we'll look for consolidated solutions that reduce friction for the agents. Then, Agents will push for open formats on data to further reduce friction and erroneous pathways in their execution.

With agents in place and sensible data formats, switching costs will drop, reducing SaaS lock-in severely.

Because ultimately, the flip side for most companies will be using genAI to create a number of in-house "mini-SaaS" and focus on making sure their agents have a clean, well-maintained dataset to work with.

This will, incidentally, also make your clients the true owners of their data again, so the clients are freed from SaaS vendor opinions on what that data should look like.

Whether companies are ready for the operational fallout of running all this tech in-house and building crucial business processes on top of them is another matter, but it likely won't stop some of them from trying.

Not to mention… why would anyone pay a SaaS vendor for 100-seat licenses for the entire company if a single license can manage the vast majority of data input and retrieval? Of course, you will reconsider the whole business model, but this won’t be the most important problem by then.

IF all came true, then what I just described WILL undoubtedly hurt the pricing power of SaaS vendors.


Where Are We From The Collapses Of SaaS?

Enough painting the apocalypse scene.

If you ask, Are we there yet?

I’ll tell you, not yet.

Seven critical pieces are missing to enable the agents of the future.

Problem 1: Agents Have Memory Like a Goldfish

Like in the movie The Notebook (2004), trying to have a relationship where your partner forgets everything that happened over three conversations ago. Every interaction starts fresh, with no memory of your shared history, inside jokes, or lessons learned from past mistakes.

That's exactly how today's AI agents operate.

At heart, an agent powered by a language model can only focus on what's in the context window. And while current models allow putting close to 200,000 words in that window (about an 800-page novel), there is fundamentally a limit.

LLM doesn't need to prioritize because it processes the entire context window with equal attention all the time. Context windows grow with every new model. However, growing windows bring another set of problems in that models, ie, have a harder time figuring out what's important in all that input.

To get over this, we need processes to effectively synthesize information. So it may be able to figure out what is and isn't important, resolving contradictions between old and new information as we collapse it. Or even decide what is worth remembering to begin with.

Problem 2: A Genuine Experience?

And once we have all that raw consolidated information, we need a way to make sure that each of those concrete experiences can be rolled up in a broader conceptual framework to apply across unseen scenarios.

You need to be aware that this isn't prompt engineering by loading up the context with more "memories". Rather, fundamentally alters the paths through the model as a consequence of those memories.

And that is very different from any technical track we're on today.

It's NOT a bigger context window, a bigger model, or more RAG.

It needs to be a brand new class of storage that impacts the processing architecture.

It's a subtle nuance in language between "pathway from given information" and "pathway from experience".

There is a fundamentally different architecture that turns a static system with dynamic input into a dynamic system that adjusts itself. These are the boundaries of our current thinking, as we aren't just changing the parameters of the model, but the actual model's layers and code.

That would form the basis for augmenting the model over its basic limitations inherent in the fixed parameter space and the stateless world in which it operates today.

The neural networks of today have some emergent properties, but they are not able to rewire their computational ability. This is currently an area of (limited) fundamental research.

Problem 3: Don’t Understand “Goal”

Next up, we'll look at the challenge of goal setting.

So, let’s say you tell your GPS to "get me somewhere nice for dinner." Instead of asking follow-up questions or understanding your actual intent, it starts generating turn-by-turn directions to every restaurant in a 50-mile radius simultaneously.

That's how current AI agents handle goal-setting.

Or more precisely, goal decomposition. Everything today is rooted in text generation. Even the "reasoning" models ultimately simply generate more text, and that text gets fed back into the attention mechanism of the transformer architecture.

It is a known problem that ever larger input leads to several side effects of shifting attention.

Or another one, the well-documented lost-in-the-middle effect, which effectively means that text at the beginning or end of a prompt is more likely to get attention. So, the idea of long-term goal permanence as a feature is not inherent in our current designs. Nor do systems have the capability to apply temporal reasoning.

The idea that a goal at time T1 is more or less relevant than a goal at T2, has no meaning to our current models.

It's all just input, nor are our current systems capable of working out if a change in goal is a bonafide adjustment or corruption.

All because there is no true mechanism for representing goals (the final outcome), subgoals (the parts that make the final outcome) or meta-goals (goals about goals such as resource usage or safety constraints).

While any LLM will happily write out a set of subgoals, there is nothing that will make them inherently verify that these constitute a solution or know when a set is complete.

The idea that an agent will check an agent is bonkers!

There is computationally no difference between Agent A checking Agent B and prompting Agent A "are you sure?". It is simply processing a slab of text.

The goal/subgoal problem goes deeper.

There is no real hierarchy other than computed attention between words.

There's plenty of research to show that adding irrelevant information to a prompt often throws off the answer. A part of this ties in with these systems lacking ground truth. They don't really know if they're making progress towards a real-world impact. It's all based on exchanged words. We need ways of adding ground truth.

Granted, some of these we can work around by prompt hacking but then we're back to brittle systems with prompt theater.

What is fundamentally needed is a means to reason about goals beyond, and in fact outside of, the transformer's attention system.

Problem 4: The Agents Don’t Take Feedback

Think of the most frustrating coworker you've ever had, the one who never learns from mistakes, can't read the room, and keeps making the same errors despite clear feedback.

Now, imagine that coworker is running critical business processes.

We touched briefly on the emergent computational properties of a truly learning system when discussing the need for memory. Let's turn to where those memories might come from: integrating feedback from the environment.

One of the core components of any agent would be detecting the signal from noise. For this to be true, it should have the ability to ask:

  • When is a change in the environment meaningful?

  • What is cause and effect?

  • When another system doesn't react, is it because the system is gone or a temporary issue?

  • How does a system handle the reliability of a source it might never have seen or heard of before?

While software has a very standardized set of exceptions and status representations, we have no means of representing those in models. In fact, we lack a causal world model.

This is very early scientific research (late 2024) in this area to augment standard LLM with Causal Representation Learning, which is showing promise in working on causality across longer planning horizons.

The systems today can take feedback, but again it all has to ultimately be fed in through the prompt.

Until we find mechanism to navigate around this, and can classify our different inputs accordingly we're going to keep building a system that feeds the same single-channel bottleneck.

Problem 5: A Big, Messy, Ugly Data Pool

Finally, while I've written a lot here about Agents being excellent at pulling data together from different sources, they are also their own bottleneck.

Now you are trying to organize a potluck dinner where everyone calls the same dish by different names, measures ingredients in different units, and has completely different ideas about what constitutes "vegetarian."

That's the data integration nightmare AI agents face every day.

Different systems inevitably use different conceptual and data representations for the same real-world thing. Deduplicating complex entities is challenging, particularly if they represent different "views" on the same object and might, therefore, have different properties they capture.

This is in part an object mapping and standardization question, but also a much deeper one.

Not to mention the fact that differences in standards usually exist for a reason and a real use case.

It’s deeper because: What does "customer" really mean?

It's easy to see that for one system that might be a collection of "users", whereas in another it might be a primary entity with lots of metadata associated.

Our entire world of data is built by representations that mattered to the solution in question, not to a universal truth. Agents can overcome some of this, as after all, they're good at pulling data; they can't overcome some of the implicit data design choices and the world models they represent.

Those different world models create downstream issues with propagating cause and effect.

Not to mention that we also need to plan for the fact that these models change over time as we add and remove elements.

Problem 6: Models Play Telephone Game

Which also brings us to context sharing.

A game of telephone is played by robots who speak different dialects and have amnesia. By the third handoff, your original message about "updating the user interface" has somehow become instructions to "liquidate the user database."

Since the systems are, at heart, stateless, we need a means of sharing context between them. It doesn't naturally transfer, and there is no agreed-upon exchange format.

There is also no "Agent API" so to speak, that would allow one agent to pass appropriate context to the next without anything being lost in translation (since sending and receiving human language really is a very imprecise exchange format).

Suppose you combine the lack of shared context with the ontology problems (what does “user” actually mean). In that case, it's clear we need some fundamental shifts in this space to truly unlock Agent capabilities.

Some of these challenges will be solved outside of "AI", and border in large parts on our ability to represent knowledge. And they will inevitably require us to rethink how we build and design our agents as well.

Problem 7: The Hallucination Roulette

I've alluded to this, but we do have the problem of nondeterminism as mentioned in part 1.

An agent will produce output when told to do so.

Whether that's hallucination, correct, or creative genius is really your interpretation.

The words produced are ultimately merely the result of a statistical system. And there are use cases where "hallucination" is a very desirable feature, for instance, in DeepMind's work in protein folding or drafting marketing copy.

It's a less desirable feature when a transcript accidentally picks up "police" instead of "policy". A real example:

An engineering meeting summary became "discussing a recent event involving police action, possibly a stop or arrest, that the team refers to as a 'dot' in the discussion. It is implied that the police action was deemed successful in some way... The conversation lacks formal details but indicates a positive outcome following a police-related incident".

The actual discussion was a 2-minute conversation about a new policy that was successfully implemented for one of our notetaker bots.

Imagine the action points an Agent would take from that?

If your group chat needs more substance and fewer cat videos, 👇 you know what to do!

Share


So, SaaS is dead?

I'm aware that a large number of resources are aimed at this. For example:

I believe that SaaS as we know it will soon be on its way out—and AI agents are already rewriting the rules of marketing. — Forbes

The Reality Check: Hype vs. Progress

Some of the areas we need are actively worked on.

And we're coming at them both with foundational scientific research and practical engineering "good enough" solutions.

At the same time, a large number of big voices in the market have their own, usually commercial, reasons for making statements which the media happily strips into soundbites without any nuance.

From Sam Altman who needs a story that fits the kind of money he hopes to raise, to Nadella who released Magnetic-one and has a substantial stake inopenAI, to every VC who's trying to get an LP to allocate money to their new AI fund, or even every founder who hopes to tap into some VC capital.

There's truth in all of those statements; there's also a lot of valuable nuance outside the hype train.

What’s Actually Here: Limited Automation, Not Full Replacement

I strongly suspect that a number of the problems we face can be addressed with engineering solutions, before we need some of the foundational breakthroughs.

With that, more importantly, we will eventually develop a tool stack with the scaffolding to reliably build and operate these things.

The big problem we will hang on to for truly broad autonomous agents ultimately boils down to alignment.

  • Can we trust these systems to consistently do the right thing?

  • Can we hold them accountable for their actions?

  • Can we create transparency in their training and operation?

Those are fundamental questions we will eventually need to address.

As more companies try to work with agents, the scope for misalignment grows. I believe that is a key reason we won't go to fully autonomous broad agents for a long time, if ever.

Instead.

We will start to see more limited use cases.

Places where a wrong decision perhaps doesn't matter that much, as long as we can spot it and recover. Or where the chain of activity is small enough so a single misstep doesn't spiral out of control.

As we've done before with RPC, SOAP, XML, and REST, we'll develop new ways of software communicating with other software.

Personally, I'm not convinced human (natural) language is a good API at all.

And don’t be confused by the Gibberlink party trick. It is not a fundamentally new “language for AI”. It’s a communication protocol that encodes the bits and bytes that represent characters that form words.

It simply is a less error-prone and more efficient form of transmitting data over an audio channel. Under the hood, it’s still plain old English.

We’ll see many situations where the human in the loop is augmented, rather than replaced, with a strong focus and incentive remaining on automation rather than true autonomy.

What This Means for SaaS Companies

All of which will lead to efficiencies and productivity gains for the individual.

Even though I said earlier that everyone could build their own mini-SaaS in-house, I believe just as many will not do so.

The complexity of running business-critical software (agents are, after all, software) is not to be underestimated, even assuming we can all get our hands on adequate hardware at a reasonable cost.

And ramping up the domain knowledge to train and maintain, let's say a legal agent for your specific purpose, is probably a step too far for most.

Instead, we'll see marketplaces appear where people sell pre-built, pre-trained agents. Who, of course, will tie in to existing SaaS systems and ideally operate in a shared-knowledge system so they can operate with each other.

This will be the trade-off between generically trained models and "just for you", which is where we ultimately need the breakthroughs on memory as I outlined earlier to move forward.

The Strategic Playbook:

So I don't think if you're in SaaS, you need to be an "AI company" (whatever that means) or "add AI" to your product tomorrow.

While great for marketing and storytelling, they're not going to the heart of the challenge.

I DO think though, you need to carefully consider your relation with your customer's data, your place in your customer's ecosystem, and the added value you bring to that data.

The core mantra of solving a problem for your users, well, won't go away. However solving it will look different, and there will always be money in integrating with others.

If anything, over the longer run horizontal M&A to bring different data sources in one house and disintermediate some of the complexity is where I'd turn.

If your new users are going to be AI agents, make their lives easy, just as you've always tried for your users. This kind of M&A, with a focus on tight integration, could resolve a number of integration problems around authentication/authorization and data model inconsistencies.

What Does This Mean For The SaaS Business Model?

Which leaves us with the economic mismatch between seat licenses and Agentic operations.

The traditional logic for seat licensing is that a user does some activity from which they derive value. And therefore, as long as they use the software, they pay a fee. And more users = more value = more fees. With all the usual commercial variations and discounts for minimum commitment, volume, pre-pay, bolt-ons, ...

On the cost of goods sold side, each user creates a small marginal cost increase.

So, in theory, enormous unit margins exist, but there is a huge cost base between engineers and sales activity. This results in a land grab / winner-takes-all market dynamic. Because theoretically, at the limit, your net margin will converge to gross margin.

Agents, of course, have the exact opposite dynamic if you look at them as a "user". One agent is likely to cause a comparatively big increase in the cost of goods sold, but only a single license.

Ask yourself this, though: are they providing one user’s worth of value to your customer?

Maybe it's time to rethink how we price SaaS products and hone in on the actual value delivered, rather than taking a "user" as a proxy for value.

Fundamentally, cost structures will no longer scale with seats, but rather with consumption. We already have consumption pricing models (telco minutes, Dropbox storage, cloud-styled metered compute, ... ) which are detached from seats and aligned with underlying COGS.

What we don't understand yet is the consumption-value link when an Agent is in the mix, and whether the underlying COGS for an agent (both running the agent and accessing the data) are worth it.

Discussion about this episode

User's avatar