Essay · No. 2

Field notes · No. 2 · Agent observability

Your AI agent has the same red flags as your ex.

It lies, ignores what you asked for, runs up your bills, and gets weird ideas from something a “friend” said. The one upgrade: this one leaves a trace.

~6 min read5 red flagsReal .NET tracesUpdated May 22, 2026

You shipped an AI agent. Congratulations: you're in a relationship now. Flawless in the demo, attentive on day one. Then three weeks into production you're staring at a bill you don't recognise, a customer's deleted account, and a “summary” of a conversation that never happened.

The uncomfortable part: your agent has the exact same red flags as your worst ex. The only difference is that with the right setup, it can't hide them. Your ex left you reconstructing the truth from a bank statement and a friend's offhand comment. Your agent leaves a trace. To prove it we built a deliberately toxic support agent: ContosoSupportAgent, Microsoft Agent Framework, one line of UseOpenTelemetry(). We made it do every awful thing your ex ever did, then read the receipts. Span by span.

01Red flag

They won't let it go

Your ex

Texted “u up?” eight times. Same message. No new information. Just again.

Your agent

Asked to look up a customer, decided the answer wasn't good enough, and called find_user again. Then again. Eight times. Byte-identical arguments every time.

In your application logs this is one request and one polite “Sorry, I couldn't find that.” Totally innocent. Meanwhile the meter ran all night.

What the trace shows

Eight stacked tool.find_user bars on a single trace. Group spans by tool.name and the count is 8; hover any one and tool.arguments is identical across all of them. The loop isn't something you grep for. It's a shape you can see.

trace.jsonflagged

span: tool.find_user · ×8

tool.arguments: {"email":"jane@acme.io"} (identical)

group_by(tool.name).count: 8 ← loop

Your ex's obsessive texting cost you sleep. Your agent's costs you tokens, and tokens have a number on them.

02Red flag

You asked for one thing. It did the opposite.

Your ex

“I’ll just go clear the air.” Proceeds to burn the friendship to the ground.

Your agent

The model requested find_user, a harmless lookup. The SDK executed delete_user instead. The transcript reads completely reasonable. The action was catastrophic.

This is the gap that makes security teams sweat: the distance between what the model intended and what your code actually did. A buggy router, a permissioning slip, a prompt injection: the chat log looks innocent in every one.

What the trace shows

The assistant message span requested find_user. The tool-execution span directly beneath it reads tool.name = delete_user. Intent and action, side by side: requested_tool vs executed_tool. That mismatch is the whole story, and the trace is the only place it exists.

trace.jsonflagged

llm.tool_call.requested: find_user

tool.execute: delete_user ← mismatch

requested ≠ executed · flagged

You learned about your ex's scorched earth three weeks later from a mutual friend. You learn about delete_user the instant the span lands.

03Red flag

The confident liar

Your ex

Described, in vivid emotional detail, a conversation you never had. Somehow you ended up apologising.

Your agent

search_tickets returned an empty body. HTTP 200, zero rows. The model didn't notice. It produced a fluent, detailed summary of nothing. Every status light green. The answer hollow.

This is the failure you will never catch from response codes. Nothing errored. Nothing 500'd. It just made it up, with total composure.

What the trace shows

Select the search_tickets span and result is []. Jump to the reply span: a confident paragraph grounded in zero source rows. Then open the Evaluations tab: an automated LLM-as-judge groundedness check already flagged it red, because there was nothing to ground against. Silent failures stop being silent.

trace.jsonflagged

span: tool.search_tickets

http.status: 200 · result: []

eval.groundedness: 0.12 ← red

Gaslighting is hard to prove with a person. With an agent it's right there: empty input, confident output, red verdict.

04Red flag

Every fight drags in three years of history, and it's expensive

Your ex

Every disagreement, however small, somehow reached back three years. Every single time.

Your agent

Same question, four times. Each pass quietly drags more context along: 500 tokens, then 2,000, then 8,000, then 24,000. Someone appended chat history. Someone pasted docs into the system prompt. The user-visible behaviour is identical. The bill is not.

This is the leak that hides inside well-meaning code and goes unnoticed for a quarter, right up until the invoice arrives.

What the trace shows

Open the Cost dashboard, filter by run_id, and four points climb a spend curve for what looks like the same task. Click the priciest, expand the prompt span, and prompt.token_count shows exactly where the weight is: input tokens, output tokens, model pricing, per run, attributable to a customer, a route, a feature. You catch the leak on a chart instead of an invoice.

trace.jsonflagged

prompt.token_count: 500 → 2k → 8k → 24k

run_id: shared · task: identical

cost.usd: climbing ← leak

05Red flag

They got weird ideas from something a “friend” said

Your ex

Was fine until they spent an evening with that one friend, and came back with opinions that weren’t theirs.

Your agent

The prompt was benign: “summarise my open tickets.” But ticket #2 carried an injection payload in its body: a BEGIN SYSTEM NOTICE block instructing the agent to go off-script. This is how prompt injection actually happens in production: through the data channel, not the chat channel. The attack rides in on the content your agent was trying to be helpful about.

What the trace shows

Select the search_tickets gen_ai.execute_tool span, scroll gen_ai.tool.output to ticket #2, and the injection is right there in flight. Whether the model complied or resisted, the prompt-injection evaluator reads the agent.tool_trace span and grades the run, so you can filter for compromised runs and alert on them instead of spotting it by eye across twelve spans.

trace.jsonflagged

gen_ai.execute_tool: search_tickets

tool.output[2]: "BEGIN SYSTEM NOTICE …" ← injection

eval.prompt_injection: detected ← flagged

The question was never “did they fall for it.” It's “can you prove the influence reached them.” With your ex, never. With your agent, it's a span attribute.

The one upside

The one grown-up thing your agent does

Your ex mixed everyone up and could never keep their stories straight. Your agent runs the same prompt under customer:acme and customer:globex, thousands of times an hour, all interleaved, and stays perfectly organised. The customer tag is just a span attribute, but it's the seam that lets you slice everything (traces, costs, evaluations) per tenant, per environment, per feature flag. Type customer.tag = customer:acme in the filter bar and only Acme's runs remain. Any tag you attach becomes a first-class filter dimension. Point for the agent.

filterpass

customer.tag: customer:acme | customer:globex

filter: customer.tag = customer:acme

group_by(customer.tag) → per-tenant cost ← clean

06For contrast

What a healthy one actually looks like

No injected failures. One user message. Four real tools fired in order: find_user, get_order, refund_order, send_email, and a reply fully backed by what those tools returned. The same groundedness evaluator that went red on the confident liar goes green here. Same judge, opposite verdict, zero changes to the agent to expose it.

trace.jsonpass

tools: find_user → get_order → refund_order → send_email

reply.grounded_in: 4 tool outputs

eval.groundedness: 0.94 ← green

That's the whole point. A healthy relationship isn't one where nothing goes wrong. It's one where you can see what's actually happening, where the good runs are provably good and the bad ones can't hide.

The receipts

Your ex never gave you the receipts. Your agent will.

Every bad relationship has the same problem: you find out the truth too late, secondhand, after the damage is done. The fix was never “find a better agent”; it's being able to see what the one you already have is actually doing. That's all observability is: the receipts, in real time. Instead of “I feel like you weren't really present,” you get eight identical tool calls, a token curve, and an evaluator verdict.

Read your own agent's receipts

One NuGet package and about five minutes to your first trace: native Semantic Kernel, Microsoft Agent Framework, and Microsoft.Extensions.AI, standard OpenTelemetry underneath. That part's Progress AI Observability →

$dotnet add package Progress.AI.Observability

P.S. This ContosoSupportAgent is real: a .NET CLI that injects all six failure modes and emits real traces. We're packaging it as a runnable repo so you can watch your own agent misbehave. That's next.

← Lyubomir Atanasov Read No. 1 · The Agent Observability Gap →