Explainer series · No. 1 · Agent Observability

The Agent Observability Gap

Why agents that look healthy in a playground go sideways the moment they meet real users, and what a useful observability stack for agent-shaped systems has to cover.

~8 min read6 interactive modulesUpdated Apr 22, 2026

A single prompt-and-response in a notebook hides almost everything that will actually decide whether an agent ships. In production, each step, the LLM call, the tool invocation, the retrieval hop, is a span inside a larger trace: one unit of work with its own inputs, timing, and outputs. Most of those spans sit below the waterline of a traditional APM. If you own an LLM feature in production, this piece walks through six ideas about where the gap lives and what it takes to close it.

01Playground vs Production

The same agent lives in two very different universes.

In a playground, an agent looks like a single prompt and a single response. In production, that same agent fans out into retrieval, tool calls, retries, external APIs, and whatever guardrails sit around them. Each hop has its own latency, its own cost, and its own way of silently going wrong. Most inherited observability built for request-and-response web apps, a shape their agent outgrew long ago.

So what does that actually look like when you flip the same agent from a playground demo into a live production run?
What you build for · 1 call
Playground

One prompt. One response.
All green.

User prompt
LLM call
Response
1.2s$0.003OK

The happy path. One call, one outcome, exactly what you built for.

Production trace
trace_id · 9a1f…c4e0

Seven spans. Four silent failures. Almost none raise an exception.

Inspector · Retrieval
520ms · span #2

APM sees a successful vector query. It can't judge whether the retrieved context was actually right.

timeline · spans2.3s total
p99 8.4sCost 140× playgroundDrift detectedRetrieval relevance 0.31
TakeawayIf you only test an agent through the playground, you're measuring the happy path. The graph underneath is what decides whether it survives real traffic.
02Failure Modes

Agentic failures don’t look like bugs.

Agent failures rarely raise exceptions. Forrester catalogues , and almost none of them surface in unit tests: a retrieval returns a plausible-but-wrong chunk, a tool call succeeds with an HTTP 200 on the wrong action, a reasoning loop burns tens of thousands of tokens before quitting. The five below are the ones most teams meet first, and the five that traditional logging is least equipped to spot.

Which five, specifically, and what would actually catch each one before it reaches a user?

Retrieval Failure

What it looks like

Your RAG pipeline returns stale, wrong, or irrelevant chunks, the model answers confidently anyway.

Why traditional tools miss it

APM sees a successful vector query. It can't judge whether the retrieved context was actually right.

In the wild
severity

Tool-Call Failure

What it looks like

The agent picks the wrong tool, passes bad parameters, or the tool fails silently mid-chain.

Why traditional tools miss it

HTTP 200 hides semantic errors. Logs don't know which tool the agent should have called, or whether a prompt-injection attempt just redirected it.

In the wild
severity

Latency Spikes

What it looks like

p50 looks fine. p99 is 12 seconds on the exact flows your power users hit most.

Why traditional tools miss it

Request-level metrics average away the tail. Per-span latency across an agent chain is invisible.

In the wild
severity

Cost Blowouts

What it looks like

A reasoning loop burns 40k tokens on a single request. One user racks up $200 before lunch.

Why traditional tools miss it

Infra dashboards track CPU, not tokens. Cost per trace, per user, per model is not a native concept.

In the wild
severity

Output Drift

What it looks like

Same input, different output. Quality silently degrades after a model upgrade or prompt tweak.

Why traditional tools miss it

There's no 'error' to log. You need evals and historical replay to catch behavioral regressions.

In the wild
severity
TakeawayA green status code is not a signal of correctness. Catching these five requires per-step visibility, not just per-request.
03Logging

APM built for request-response has to be retrofitted for agents.

APM and log aggregation were built around a deterministic request-response contract: a URL comes in, code runs once, a response goes out, a span closes. Agent execution breaks every one of those assumptions. It's non-deterministic (same input, different path), stateful (tool calls feed the next reasoning step), and multi-step (one inbound request can fan out into dozens of sub-calls). That's not because Datadog, New Relic, and Grafana stood still. It's because request-response dashboards were never the right abstraction for this shape, and the LLM-native tier is still being bolted on.

How much of a single agent run does each tier of tooling actually see?
One agent run · span depth
Each lane is a span. Shaded = visible to Traditional APM.
POST /v1/agent
retrieval.vector_search
tool.search_docs
llm.reasoning (plan)
tool.update_record (200, silent err)
llm.reasoning (retry loop)
eval.llm_as_judge
0ms575115017252300ms
Captures
  • HTTP request & response
  • Top-level status codes
  • Process-level CPU & memory
Blind to
  • Per-step reasoning
  • Tool inputs / outputs / decisions
  • Silent semantic failures on HTTP 200

200 OK means the server responded. It does not mean the agent was right.

TakeawayStack traces answer “what crashed?” Agents need an answer to “what did it decide, with what context, and why?”
04The Seven Pillars

A working observability stack for agents covers seven layers, not one or two.

Most “LLM observability” stories collapse into a single feature: pretty traces, or eval dashboards, or token counters. A useful stack for agent-shaped systems has to do all seven at once (traces, evals, cost, latency, drift, replay, and guardrails), because the failure modes interact. A retrieval regression shows up as a drop in eval score, as a jump in retries, and as a cost spike in the same trace. A prompt injection shows up as a tool call the policy should have blocked. Miss any one of the seven and you're debugging blindfolded on the others. Each failure mode in Section 02 maps to one or two pillars below: the pillars are the instruments, the failures are what they detect.

What does each of those seven layers actually do, and why do they only work together?
Pillar 1 / 7

Traces

Full reasoning path and tool chain for every agent run.

Why it matters
What this looks like in practice

OTLP-native spans across every step, carrying gen_ai.operation.name, prompts, completions, and tool I/O in one view.

TakeawayTraces without evals tell you what happened, not whether it was right. Evals without cost and latency tell you what's right, not whether you can afford to ship it. Everything without guardrails is one prompt-injection away from a headline. All seven together is what lets a team run agents in production with confidence.
05Self-Assessment

Is your LLM app production-ready?

Use this as a gut check for your team, not a lead magnet. There's no submit and no signup. Five yes/no questions, each tied to one of the pillars above. The score at the bottom is a rough rubric for where your observability coverage actually sits today, and which layer is likely to be the next thing to bite.

So where does your team sit today, and which layer is the next one to bite?
Progress0 / 5
Yes coverageAnswered
Can you trace every agent step?

Without a full trace, debugging is guesswork and fixes are gambles.

Can you see exactly where it fails?

Retrieval, tool calls, and prompts all fail differently. Trace plus replay lets you pinpoint which step broke and re-run it.

Can you inspect latency and cost per step?

One slow tool or one runaway loop can tank UX and burn budget. Per-span visibility prevents both.

Can you compare runs over time?

Prompts and models change. Evals and replay tell you whether quality held or slipped.

Can you catch drift before your users do?

Silent degradation is the most expensive bug. Drift detection turns it into an alert, not a postmortem.

Your score
0/ 5

Answer the five checks above.

Maturity rubric
  1. 5 / 5
    Mature coverage

    Traces, evals, cost, latency, drift, and replay are all in place. You can debug on evidence.

  2. 3 or 4
    Partial coverage

    You can debug most failures, but one or two classes still rely on guesswork.

  3. 0 to 2
    Building from scratch

    Expect regressions, silent cost leaks, and long debugging loops until the basics land.

Where peers land: LangChain's 2025 survey of 1,340 teams found 89% have stood up some observability, 62% have step-level tracing, and 37% run online evals. The bar is rising.

TakeawayIf you answered “no” to three or more, the gap isn't a tooling preference anymore, it's a reliability risk you're already paying for.
06Where to Start

You don’t have to land seven pillars at once.

You've taken the gut check. Wherever you landed, the useful next step is the same: instrument spans first, attribute per-step cost and latency second, then layer evals and drift. Ordered by dependency, not preference. The first step closes more blind spots than any other, and every later layer needs the one before it to mean anything.

What does that actually look like across a week, a month, a quarter?
  1. 01Week 1

    Make the graph visible

    Close the retrieval, tool, and cost blind spots in a week.

    • Emit OTLP spans around every LLM call and every tool call.
    • Capture prompt, response, tool name, arguments, and status on each span.
    • Ship one dashboard that lists the slowest ten traces this hour.
  2. 02Month 1

    Attribute latency and cost

    Turn spans into per-step numbers you can budget.

    • Tag spans with model, tenant, and app for dimensional breakdowns.
    • Track token cost on every LLM span; alert on spikes, not totals.
    • Break latency into p50, p95, and p99 per span, not per request.
    builds on Week 1 spans
  3. 03Month 3

    Catch quality regressions

    Make drift the alert, not the postmortem.

    • Build a small golden dataset, calibrate one judge against human labels until agreement lands near Cohen's kappa 0.6, then run it on every deploy.
    • Track score distributions over time; flag statistically meaningful drops.
    • Wire replay: pick any failed trace and re-run a step with edits.
    builds on Month 1 attribution

The order matters more than the timeline. Week 1 is an afternoon on the Microsoft Agent Framework (built-in GenAI-conventions instrumentation), a sprint on Semantic Kernel, or a week on a custom loop with OpenLLMetry. What stays fixed is the dependency chain: you can't budget cost per step without per-step spans, and you can't baseline drift without a quality signal to baseline.

TakeawayThe Week 1 plan fits on a sticky note: one LLM span, one tool span, one dashboard. Start there, not with a tool-selection debate.
Product walkthrough · 45 seconds

What a full agent trace looks like

To close, here's a short walkthrough of a trace-native view: every step, tool call, retrieval hop, and token spent on a single agent run. It grounds everything above in something concrete.

telerik.com/ai-observability-platform
Trace-native interface · product walkthroughyoutube/7DyZbg5hzw4