Retrieval Failure
Your RAG pipeline returns stale, wrong, or irrelevant chunks, the model answers confidently anyway.
APM sees a successful vector query. It can't judge whether the retrieved context was actually right.
Why agents that look healthy in a playground go sideways the moment they meet real users, and what a useful observability stack for agent-shaped systems has to cover.
A single prompt-and-response in a notebook hides almost everything that will actually decide whether an agent ships. In production, each step, the LLM call, the tool invocation, the retrieval hop, is a span inside a larger trace: one unit of work with its own inputs, timing, and outputs. Most of those spans sit below the waterline of a traditional APM. If you own an LLM feature in production, this piece walks through six ideas about where the gap lives and what it takes to close it.
In a playground, an agent looks like a single prompt and a single response. In production, that same agent fans out into retrieval, tool calls, retries, external APIs, and whatever guardrails sit around them. Each hop has its own latency, its own cost, and its own way of silently going wrong. Most inherited observability built for request-and-response web apps, a shape their agent outgrew long ago.
The happy path. One call, one outcome, exactly what you built for.
APM sees a successful vector query. It can't judge whether the retrieved context was actually right.
Agent failures rarely raise exceptions. Forrester catalogues , and almost none of them surface in unit tests: a retrieval returns a plausible-but-wrong chunk, a tool call succeeds with an HTTP 200 on the wrong action, a reasoning loop burns tens of thousands of tokens before quitting. The five below are the ones most teams meet first, and the five that traditional logging is least equipped to spot.
Your RAG pipeline returns stale, wrong, or irrelevant chunks, the model answers confidently anyway.
APM sees a successful vector query. It can't judge whether the retrieved context was actually right.
The agent picks the wrong tool, passes bad parameters, or the tool fails silently mid-chain.
HTTP 200 hides semantic errors. Logs don't know which tool the agent should have called, or whether a prompt-injection attempt just redirected it.
p50 looks fine. p99 is 12 seconds on the exact flows your power users hit most.
Request-level metrics average away the tail. Per-span latency across an agent chain is invisible.
A reasoning loop burns 40k tokens on a single request. One user racks up $200 before lunch.
Infra dashboards track CPU, not tokens. Cost per trace, per user, per model is not a native concept.
Same input, different output. Quality silently degrades after a model upgrade or prompt tweak.
There's no 'error' to log. You need evals and historical replay to catch behavioral regressions.
APM and log aggregation were built around a deterministic request-response contract: a URL comes in, code runs once, a response goes out, a span closes. Agent execution breaks every one of those assumptions. It's non-deterministic (same input, different path), stateful (tool calls feed the next reasoning step), and multi-step (one inbound request can fan out into dozens of sub-calls). That's not because Datadog, New Relic, and Grafana stood still. It's because request-response dashboards were never the right abstraction for this shape, and the LLM-native tier is still being bolted on.
200 OK means the server responded. It does not mean the agent was right.
Most “LLM observability” stories collapse into a single feature: pretty traces, or eval dashboards, or token counters. A useful stack for agent-shaped systems has to do all seven at once (traces, evals, cost, latency, drift, replay, and guardrails), because the failure modes interact. A retrieval regression shows up as a drop in eval score, as a jump in retries, and as a cost spike in the same trace. A prompt injection shows up as a tool call the policy should have blocked. Miss any one of the seven and you're debugging blindfolded on the others. Each failure mode in Section 02 maps to one or two pillars below: the pillars are the instruments, the failures are what they detect.
Full reasoning path and tool chain for every agent run.
OTLP-native spans across every step, carrying gen_ai.operation.name, prompts, completions, and tool I/O in one view.
Use this as a gut check for your team, not a lead magnet. There's no submit and no signup. Five yes/no questions, each tied to one of the pillars above. The score at the bottom is a rough rubric for where your observability coverage actually sits today, and which layer is likely to be the next thing to bite.
Without a full trace, debugging is guesswork and fixes are gambles.
Retrieval, tool calls, and prompts all fail differently. Trace plus replay lets you pinpoint which step broke and re-run it.
One slow tool or one runaway loop can tank UX and burn budget. Per-span visibility prevents both.
Prompts and models change. Evals and replay tell you whether quality held or slipped.
Silent degradation is the most expensive bug. Drift detection turns it into an alert, not a postmortem.
Answer the five checks above.
Traces, evals, cost, latency, drift, and replay are all in place. You can debug on evidence.
You can debug most failures, but one or two classes still rely on guesswork.
Expect regressions, silent cost leaks, and long debugging loops until the basics land.
Where peers land: LangChain's 2025 survey of 1,340 teams found 89% have stood up some observability, 62% have step-level tracing, and 37% run online evals. The bar is rising.
You've taken the gut check. Wherever you landed, the useful next step is the same: instrument spans first, attribute per-step cost and latency second, then layer evals and drift. Ordered by dependency, not preference. The first step closes more blind spots than any other, and every later layer needs the one before it to mean anything.
Close the retrieval, tool, and cost blind spots in a week.
Turn spans into per-step numbers you can budget.
Make drift the alert, not the postmortem.
The order matters more than the timeline. Week 1 is an afternoon on the Microsoft Agent Framework (built-in GenAI-conventions instrumentation), a sprint on Semantic Kernel, or a week on a custom loop with OpenLLMetry. What stays fixed is the dependency chain: you can't budget cost per step without per-step spans, and you can't baseline drift without a quality signal to baseline.
To close, here's a short walkthrough of a trace-native view: every step, tool call, retrieval hop, and token spent on a single agent run. It grounds everything above in something concrete.