Most AI Agents Fail in Production Because They're Built Backwards

The agent system failed miserably in production, it was nothing short of spectacular. There was no crash. There is no error message. The system continued to work and produce results that looked reasonable until someone read it carefully to see that something was off.
When we decided to look into it, it took us two days of troubleshooting to figure out what was going on. Funnily enough, the model was not accurate, and the input and output tools were producing the correct results.
The problem, when we finally found it, was one of construction. The model and tools were set up correctly, but the idea was that the logic would bring everything together, which, as you can guess, obviously failed.
It turns out that thinking does no such thing.
That experience is what I keep coming back to when I think about why so many AI agents that work in demos don't really hold up to real-world use.
It's not a skill issue.
It is for construction.
And if you've read my previous piece here on TDS, Why AI Developers Are Moving Beyond LangChain to Native Agent Architectures, the pattern should sound familiar: systems are built top-down, from goal to tool to model, with the tacit assumption that intelligent behavior fills in the gaps.
That thinking is what “backward construction” means. And it's more common than most teams realize until something breaks.
Agents Are Not Businesses. They are Systems.
A production AI agent is not a single intelligent entity.
Instead, there is a set of interacting pieces with different responsibilities, failure modes, and recognition levels.
The LLM is one of those parts, not the whole program. Its just one episode.
It may sound obvious when you say it out loud. But the “independent agent” framework that dominated 2023 and most of 2024 keeps pulling developers into a different mental model: one entity, one logic loop, everything handled by the model.
All you need is the tools, a good system knowledge, and the hope that everything will work out.
In contrast, developers who have shipped real AI-based products rarely describe their systems that way. What you are describing actually sounds similar to the architecture of distributed systems.
Not because they've read a book about design patterns, but because they've been burned enough times that they start taking design seriously in their practice.
Building from the top down, starting with “what this agent needs to do” and working backwards to create tools and information, is quick to get started.
It's also how you end up with a system where the model is responsible for so much, and nothing can be fixed per se.
The architecture was determined by mission, not by engineering requirements.
That's the back part.
So What Really Goes into a Manufacturing System?
The abstract version is easy to nod to. Here's what it actually looks like.
Every productivity AI system I've seen that works cleanly has something like decision layerregardless of whether the party named it this way or not. It is the part where the model lives and does its real work.
The instinct is to push everything to this stage: transfer requests, manage memory, manage retries, resolve tool failures.
This is fine if you are working in a Jupyter notebook. In production, under load, with real users, this becomes part of your system where everything is everyone's fault, and many times, nothing can be fixed.
A decision layer should do one thing well, and that is decide what to do next, given some context that has already been prepared for it.
That's all the work.
Who prepares the context? Other. Who takes action on the decision? One more thing.
That “something” is the orchestration layer, and in most well-built systems, it's just the actual code: conditions, concurrent runners, recaps, line handling, maybe even a state machine depending on how involved the workflow is.
Many teams approach frameworks here because bare-bones orchestration code sounds too simple, like there should be more infrastructure.
Usually there is none.
If this layer contains a little magic, you will quickly find bugs when they appear. And they will appear.
In my experience, I learned this the hard way in a project where the orchestration lived inside the framework's singing model. Something was retrying tool calls in a way that corrupted the downstream state.
We spent two days trying to figure out the issue. Two days of a bug that could have been solved in an instant if the retry idea had been three lines of Python that I wrote myself.
This leads us to the tools and execution layer, where all communication takes place.
Now, i tools and execution layer it is where things communicate with the outside world. This layer usually has just one job, and that is to take a well-defined input and produce a predictable output.
But the failures I kept seeing, and I kept repeating, honestly, were tools that tried to help by doing more than one thing. A single function that calls the API, updates the cache, and does other things.
In such a system, when it breaks, you don't know where. Even if you try to replace the API, you're freeing up a brain that shouldn't have been messed with in the first place.
Memory and condition that's where I can push the most, because that's where most teams are most unprepared.
Many teams think of memory as “what the model knows.” The most important question is what system you know, and whether that information is current.
I remember one day when it took me an afternoon to fix what seemed to be a “mirror idea. The model kept referring to user preferences, which was updated twenty minutes ago.
That's not a problem with the model.
That is a programming problem.
And it's surprisingly common.
In multi-agent systems, in particular, the shared environment is where subtle failures arise. One agent is reviewing something. Some don't know.
Everyone develops confidence in slightly different ways. The output looks almost right, which is almost worse than looking wrong.
And then there is inspection and observationalmost everyone always puts it off until something goes wrong. I have been guilty of this as well.
The difference I remember is that logging tells you what happened. Seeing tells you what happened was right. In a nutshell, those are close to the same thing.
In an AI system, not so. You should be able to follow a particular request from start to finish, including what information the model had to process, what decision it made, how the external API called it a request, and what it did with its response.
Building it the right way
It starts with a top-down approach: I want the agent to do X, so I'll give it tools, good system information, and if the model is smart enough, it'll be fine.
And this is exactly what people use to make prototypes, and why wouldn't they? They are not wrong.
But here's the thing: the problem is that it treats design as the result of a goal rather than something you design on purpose.
Then the system expands. You know, more tools, more workflows, more edge cases, more users, and suddenly there's no real foundation under any of it.
Going to the top is time consuming, but very comfortable.
You start with the basic building blocks and make sure they really work. Then find out what each part should talk about, what data it contains, and what it is responsible for.
Ultimately, a system takes shape naturally from the interaction of its parts.
This is not an “argument about real developers building everything from scratch”. It's not even about using tools at all, really. It's about the mental model you build.
I've seen developers implement complex frameworks and build clean systems because they understood what each layer was supposed to do.
I've seen developers write vanilla Python and create an unfixable mess because they still thought “the agent decides everything.” Tools follow the model in your head, not the other way around.
The most robust multi-agent system I've had the opportunity to work with had almost no AI-specific infrastructure. When I first saw the repo, I honestly thought I was looking at the wrong codebase.
A message queue, worker processing with different scopes, shared state storage with clear read/write contracts, and a dispatcher that makes routing decisions.
The language model questions were created by the workers themselves, each receiving a set of contexts created from above by a different system.
All in all, everything was about a thousand lines of Python. I've seen demo agents with more code than that. Every part was traceable.
If something behaved unexpectedly, we usually found the problem in less than an hour because there was no magic we could look at. Just put the code in the clear form in it.
That system was built from the ground up. The purpose was defined, but the structure was not taken from it. Components are first designed, independently analyzed, and then engineered to implement the desired performance. The latter is the most important factor, not the former.
Where I Think It Goes
As far as I know, the way we are going is gradually moving away from “agent frameworks” and towards proper infrastructure, with test systems, path models, loops, and state management.
At least some of them already exist. Most are yet to come as people solve the hard manufacturing problems in this space.
Something I see over and over again is that the people who build the most reliable systems rarely use the best models. What they do not have instead is a clear understanding of everything that is happening within their systems.
The model used by such a program would be GPT-4, but it could be a small spatial model. It doesn't matter if everything else is working fine.
We move from treating the model as a product to treating the system as a product. The model is important, but it is only one part among many.
Most agents fail because the model wasn't good enough. They fail because the system around the model is designed backwards, starting with what the agent should do and assuming that the architecture will solve itself.
It doesn't.
Building it right, components first, behavior second, is what separates systems that stick from those that look impressive and don't.
Before you go!
I write more about the actual engineering decisions behind AI systems, where abstraction helps, where it hurts, and what it takes to build it reliably.
You can sign up for my newsletter if you would like more of that.
Contact Me



