AI Agents Don’t Crash. They Lie.
AI agents don’t fail dramatically — they lie confidently. The real danger in production isn’t crashes, it’s silent plausible-sounding errors that compound before anyone catches them.
Traditional software fails loudly. An exception gets thrown. A stack trace lands in your logs. A red alert fires. You know something is wrong.
AI agents fail quietly.
Furthermore, an agent that’s gone wrong doesn’t crash, it confidently returns the wrong answer, silently posts to the wrong account, or goes dark mid-task while reporting success. By the time you notice, the damage is done. 42% of companies that launched AI initiatives in 2024 quietly abandoned most of them, according to S&P Global. Nobody gave a conference talk about it.
Additionally, i’ve been running AI agents in production for a while now. Here are the failure modes that actually happen, not the ones that get written up in safety papers.
The Confident Wrong Answer: Understanding AI agent failure modes
The first thing that breaks your intuition about AI agent failures: they look like success.
A normal software bug throws. An agent bug ships. The difference is that the model is always trying to be helpful, it will generate an output no matter what. When it doesn’t know something, it invents something plausible. When a tool call fails, it might silently skip it and synthesize an answer from its training data instead.
I’ve seen agents that were supposed to query a live database but were silently falling back to cached (stale) data. The outputs looked completely normal. The evals passed. Users were getting answers three months out of date and had no idea.
In fact, the fix isn’t just better error handling. It’s treating every agent output as unverified until proven otherwise.
Silent Tool Call Failures
Furthermore, importantly, tool calls, the actions agents take in the world, are where silent failures get dangerous.
Additionally, notably, an agent that can write emails, post content, update a database, or trigger a payment should terrify you slightly. Because when a tool call fails and the agent doesn’t properly surface that failure, it either retries infinitely, skips silently, or, worst of all, reports back “done” and moves on.
Indeed, the meter event bug in our own billing code is a good example. Additionally, the service was catching Stripe::StripeError and logging it, but the error was swallowed. The local counter incremented correctly. Stripe’s billing meter never received the event. Everything looked fine. We were quietly underbilling.
Furthermore, the rule of thumb: if a tool call touches money, data, or external systems, the agent needs to propagate the error up, not suppress it. No “fire and forget” without a retry path.
The Drift You Don’t See Coming
Additionally, agents that run on schedules, hourly jobs, nightly digests, automated pipelines, develop drift over time. The world changes. The prompts don’t.
A nightly research agent that worked great in January might be pulling from outdated sources by March. A content agent that learned your voice from early drafts might have picked up a style drift you haven’t reviewed. An automation that classified emails correctly for two months might start misfiling as your inbox composition shifts.
Also, the insidious part is that drift happens gradually. There’s no error. The outputs are good, just not as good as they were. And because there’s no crash, no one goes looking.
In fact, scheduled evals are the only fix. Not just running the agent, checking whether the outputs are still correct. Most teams skip this. Almost all of them regret it.
The Cascade Problem
In fact, agents that talk to other agents, or agents that trigger downstream systems, have a multiplication problem.
One error rate of 2% sounds fine. Two agents in sequence at 2% error rate each: ~4% compounded. Add a third: 6%. Add tool calls with their own failure rates. By the time you have a five-step pipeline, your effective reliability might be in the 80s even when every individual component looks healthy.
The Forbes post on agentic AI production lessons put it well: evals need to happen at every step, task success checks, retrieval accuracy checks, tool-call correctness checks. Not just end-to-end. Because the place the failure happened and the place it surfaces are usually completely different.
What Good Agent Monitoring Looks Like
Furthermore, if traditional observability is “did this function throw?”, agent observability is “did this agent do the right thing?”
Moreover, that’s a fundamentally harder question. The signals you actually need:
Behavioral diffs, not just logs. Sample outputs and compare to expected results. Flag when distributions shift. A coding agent that suddenly starts suggesting dramatically different patterns has probably drifted.
Tool call audits. Every tool call that touches external systems should be logged with inputs, outputs, and latency. Silence here is a red flag.
Confidence isn’t accuracy. Models that say “I’m confident” are not more likely to be right than models that hedge. Treat high-confidence outputs with the same skepticism as everything else.
Idempotency guards everywhere. If an agent action can run twice, especially webhooks, emails, billing events, it will eventually run twice. Build in checks before that day arrives.
The Uncomfortable Truth
The reason AI agent failures are so hard to catch is that we’re pattern-matching from decades of traditional software monitoring, and those patterns don’t apply.
In addition, crashes and stack traces are a gift. They tell you exactly what broke. Agents that fail silently are much harder to monitor because the signal you’re watching for is absence of correctness, not presence of error.
The companies that are successfully running agents in production aren’t the ones with the most sophisticated AI. They’re the ones that built the most paranoid monitoring.
Build the monitoring before you need it. You’ll need it sooner than you think.
For additional context, see recent analysis from Harvard Business Review on trends in this space.