A poorly instrumented AI agent is a black box that spends money. Model calls are expensive, tool calls can be expensive too, and the decision flow is usually non-deterministic. Without instrumentation designed for this kind of system, when something fails or the bill comes in higher than expected, the team ends up reading loose logs and trying to reconstruct the sequence by hand. After a year and a half running agents in products with real users, I have a fairly formed list of what to instrument first, which standards have emerged, and what expensive errors are avoided by having proper tracing from day one.
Why classical systems aren’t enough
Traditional observability was built around synchronous HTTP services: a request comes in, database and inter-service calls happen, a response goes out. Distributed traces capture that call tree, metrics count requests per second and latency percentiles, and logs provide context. With that set, an experienced team can answer most operational questions.
Agents break the mold in three ways. First, a single user input can trigger dozens of chained model calls, with branches and loops whose structure is only known after execution. Second, each call carries an explicit economic cost in input and output tokens that matters as much as latency, if not more. Third, input and output content, unlike a normal API JSON, is relevant for debugging and often necessary to understand why the agent made one decision over another. Instrumenting that well requires stepping out of the traditional mold and making specific choices.
First layer: one trace per run
The first thing to have is a trace per agent run, with each model call and each tool call as a nested span. That sounds obvious, but it’s surprising how many teams operate without it and only realize when the provider’s bill spikes and they can’t say which runs consumed what. OpenTelemetry consolidated during 2025 a set of semantic conventions for generative AI defining how to name spans, which attributes to use, and how to represent the relationship between the agent and the tools it calls. All major LLM SDKs (OpenAI, Anthropic, Google, Azure) already have automatic instrumentation emitting these spans with no or minimal code changes.
What you need to capture on each model-call span: model name, sampling parameters (temperature, max tokens), input and output token counts, estimated cost in the provider’s currency, time-to-first-token if streaming, total time, and critically, input messages and output text or function calls. The last point is sensitive because those contents may include personal data, and the configuration must allow redacting or hashing them per the organization’s policy.
For each tool call, the span must capture the tool name, arguments, result or error, and time spent. A tool that queries an external API also inherits its own traces, so the OpenTelemetry context must propagate through so you can see the full tree of what happens when the agent pulls a function.
Second layer: aggregated metrics
Loose traces don’t give the picture. You need aggregated metrics that let you look at system behavior as a whole. The five I always instrument are: cost per conversation in provider currency, end-to-end latency as seen by the user, success completion rate versus abandonment or error, mean number of steps per conversation, and tool-use rate per step. Those five let you answer usual operational questions without diving into individual traces: whether average cost suddenly rises, whether a new model version hurts completion rate, whether a new tool is causing problems or being overused.
Beyond those come more domain-specific metrics. If the agent has an expected happy path, say a completed booking or a resolved query, it’s worth instrumenting success rate on that path. If there’s human interaction, measure the escalation-to-human rate. If there are per-user budgets, alert when a conversation nears the limit. Those metrics are what translate agent behavior into something product and operations understand.
Third layer: production evaluations
Classical observability ends there, but agents need one more layer: evaluations that run on real conversations or samples of them to measure quality. Knowing a conversation ended isn’t enough; you need to know whether it ended well. Techniques in use are varied: manual sample annotation, model-as-judge against defined criteria, comparison against a reference set, detection of specific patterns like unnecessary refusals or detectable hallucinations.
Automatic production evaluations have real compute cost and require a clear policy: what’s evaluated, how often, with which model, and what happens with the results. What I’ve seen work is a pyramid: a small fraction of conversations evaluated by humans, a larger fraction by a judge model, and the rest with cheap heuristic metrics. The three levels are calibrated against each other periodically to ensure the automatic judge doesn’t drift from the human reference.
The most common failure pattern
The most expensive failure I’ve seen in teams running agents in production is not having a way to answer “why did this conversation cost thirty euros”. The conversation happened, the cost is on the invoice, but logs don’t contain enough detail to say whether it was an agent loop, a user prompt that filled the context window, a misconfigured tool, or a poor model choice. Without that traceability the team can’t prevent it from recurring, and the monthly LLM bill becomes a black box growing without explanation.
The cure is to instrument from day one, even if it feels like overhead. Agents start small and grow fast; setting up traces when you already have thousands of daily conversations is much more expensive than doing it at the start. And the effort is modest: OpenTelemetry has automatic instrumentation for the main SDKs, and commercial agent observability tools (Langfuse, Helicone, Phoenix, Weave and similar) plug in with minimal code changes.
A minimum path
For a team starting today with an agent in production, my recommended minimum path is clear. First, enable OpenTelemetry instrumentation in the model SDK used, with the generative-AI semantic conventions. Second, send those traces to a backend that understands conversations, Langfuse or similar work well, or a Grafana Tempo with hand-made dashboards if you prefer self-hosting. Third, define five basic business metrics and put them on a visible dashboard. Fourth, install a simple automatic evaluation, even if it’s a judge model with three criteria, over a traffic sample.
That quartet covers eighty percent of the value. After that you add finer layers as the system grows: traces propagated to internal tools, per-user budgets, comparative evaluations between model versions, specific alerts. All that makes sense once there’s a live system to observe; before that it’s over-engineering.
My reading
Agent observability moved in 2025 from an uncomfortable gap to having acceptable answers. OpenTelemetry’s semantic conventions for generative AI, the maturation of platforms like Langfuse and Phoenix, and the incorporation of automatic instrumentation in major providers’ SDKs make the entry cost affordable today even for small teams. There’s no excuse for running agents blind.
What’s still missing is full standardization in how to represent intermediate agent states, planning branches, and persistent memory. The ecosystem is moving fast there and conventions will change in the next eighteen months. The prudent posture for a team is to adopt current conventions assuming migration will happen; the alternative, building something from scratch, almost never ages well. Instrument early, instrument with open conventions, and revisit when more consolidated versions appear. That’s the policy that has worked best for me.