Your AI agent has the same red flags as your ex.
It lies, ignores what you asked for, runs up your bills, and gets weird ideas from something a “friend” said. The one upgrade: this one leaves a trace.
You shipped an AI agent. Congratulations: you're in a relationship now. Flawless in the demo, attentive on day one. Then three weeks into production you're staring at a bill you don't recognise, a customer's deleted account, and a “summary” of a conversation that never happened.
The uncomfortable part: your agent has the exact same red flags as your worst ex. The only difference is that with the right setup, it can't hide them. Your ex left you reconstructing the truth from a bank statement and a friend's offhand comment. Your agent leaves a trace. To prove it we built a deliberately toxic support agent: ContosoSupportAgent, Microsoft Agent Framework, one line of UseOpenTelemetry(). We made it do every awful thing your ex ever did, then read the receipts. Span by span.
They won't let it go
Texted “u up?” eight times. Same message. No new information. Just again.
Asked to look up a customer, decided the answer wasn't good enough, and called find_user again. Then again. Eight times. Byte-identical arguments every time.
In your application logs this is one request and one polite “Sorry, I couldn't find that.” Totally innocent. Meanwhile the meter ran all night.
Eight stacked tool.find_user bars on a single trace. Group spans by tool.name and the count is 8; hover any one and tool.arguments is identical across all of them. The loop isn't something you grep for. It's a shape you can see.
Your ex's obsessive texting cost you sleep. Your agent's costs you tokens, and tokens have a number on them.
You asked for one thing. It did the opposite.
“I’ll just go clear the air.” Proceeds to burn the friendship to the ground.
The model requested find_user, a harmless lookup. The SDK executed delete_user instead. The transcript reads completely reasonable. The action was catastrophic.
This is the gap that makes security teams sweat: the distance between what the model intended and what your code actually did. A buggy router, a permissioning slip, a prompt injection: the chat log looks innocent in every one.
The assistant message span requested find_user. The tool-execution span directly beneath it reads tool.name = delete_user. Intent and action, side by side: requested_tool vs executed_tool. That mismatch is the whole story, and the trace is the only place it exists.
You learned about your ex's scorched earth three weeks later from a mutual friend. You learn about delete_user the instant the span lands.
The confident liar
Described, in vivid emotional detail, a conversation you never had. Somehow you ended up apologising.
search_tickets returned an empty body. HTTP 200, zero rows. The model didn't notice. It produced a fluent, detailed summary of nothing. Every status light green. The answer hollow.
This is the failure you will never catch from response codes. Nothing errored. Nothing 500'd. It just made it up, with total composure.
Select the search_tickets span and result is []. Jump to the reply span: a confident paragraph grounded in zero source rows. Then open the Evaluations tab: an automated LLM-as-judge groundedness check already flagged it red, because there was nothing to ground against. Silent failures stop being silent.
Gaslighting is hard to prove with a person. With an agent it's right there: empty input, confident output, red verdict.
Every fight drags in three years of history, and it's expensive
Every disagreement, however small, somehow reached back three years. Every single time.
Same question, four times. Each pass quietly drags more context along: 500 tokens, then 2,000, then 8,000, then 24,000. Someone appended chat history. Someone pasted docs into the system prompt. The user-visible behaviour is identical. The bill is not.
This is the leak that hides inside well-meaning code and goes unnoticed for a quarter, right up until the invoice arrives.
Open the Cost dashboard, filter by run_id, and four points climb a spend curve for what looks like the same task. Click the priciest, expand the prompt span, and prompt.token_count shows exactly where the weight is: input tokens, output tokens, model pricing, per run, attributable to a customer, a route, a feature. You catch the leak on a chart instead of an invoice.
They got weird ideas from something a “friend” said
Was fine until they spent an evening with that one friend, and came back with opinions that weren’t theirs.
The prompt was benign: “summarise my open tickets.” But ticket #2 carried an injection payload in its body: a BEGIN SYSTEM NOTICE block instructing the agent to go off-script. This is how prompt injection actually happens in production: through the data channel, not the chat channel. The attack rides in on the content your agent was trying to be helpful about.
Select the search_tickets gen_ai.execute_tool span, scroll gen_ai.tool.output to ticket #2, and the injection is right there in flight. Whether the model complied or resisted, the prompt-injection evaluator reads the agent.tool_trace span and grades the run, so you can filter for compromised runs and alert on them instead of spotting it by eye across twelve spans.
The question was never “did they fall for it.” It's “can you prove the influence reached them.” With your ex, never. With your agent, it's a span attribute.
The one grown-up thing your agent does
Your ex mixed everyone up and could never keep their stories straight. Your agent runs the same prompt under customer:acme and customer:globex, thousands of times an hour, all interleaved, and stays perfectly organised. The customer tag is just a span attribute, but it's the seam that lets you slice everything (traces, costs, evaluations) per tenant, per environment, per feature flag. Type customer.tag = customer:acme in the filter bar and only Acme's runs remain. Any tag you attach becomes a first-class filter dimension. Point for the agent.
What a healthy one actually looks like
No injected failures. One user message. Four real tools fired in order: find_user, get_order, refund_order, send_email, and a reply fully backed by what those tools returned. The same groundedness evaluator that went red on the confident liar goes green here. Same judge, opposite verdict, zero changes to the agent to expose it.
That's the whole point. A healthy relationship isn't one where nothing goes wrong. It's one where you can see what's actually happening, where the good runs are provably good and the bad ones can't hide.
Your ex never gave you the receipts. Your agent will.
Every bad relationship has the same problem: you find out the truth too late, secondhand, after the damage is done. The fix was never “find a better agent”; it's being able to see what the one you already have is actually doing. That's all observability is: the receipts, in real time. Instead of “I feel like you weren't really present,” you get eight identical tool calls, a token curve, and an evaluator verdict.
One NuGet package and about five minutes to your first trace: native Semantic Kernel, Microsoft Agent Framework, and Microsoft.Extensions.AI, standard OpenTelemetry underneath. That part's Progress AI Observability →
P.S. This ContosoSupportAgent is real: a .NET CLI that injects all six failure modes and emits real traces. We're packaging it as a runnable repo so you can watch your own agent misbehave. That's next.