Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Inteligencia Artificial Metodologías

agentes ia depuración llm metricas observabilidad opentelemetry trazas

AI agent observability: what to instrument first

December 8, 2025 12 min read 138 reads

Table of contents

Key takeaways
Why classical systems aren’t enough
First layer: one trace per run
Second layer: aggregated metrics
Third layer: production evaluations
The most common failure pattern
A minimum path
Conclusion

Actualizado: 2026-05-03

A poorly instrumented AI agent is a black box that spends money. Model calls are expensive, tool calls can be expensive too, and the decision flow is usually non-deterministic. Without instrumentation designed for this kind of system, when something fails or the bill comes in higher than expected, the team ends up reading loose logs and trying to reconstruct the sequence by hand.

Key takeaways

Agents break the traditional observability mold in three ways: a single user input can trigger dozens of chained model calls, each call carries an explicit economic cost in tokens, and input/output content is relevant for debugging.
OpenTelemetry consolidated during 2025 a set of semantic conventions for generative AI defining how to name spans, which attributes to use, and how to represent agent-tool relationships.
The five aggregated metrics to always instrument: cost per conversation, end-to-end latency, successful completion rate, mean steps per conversation, tool-use rate per step.
The most expensive failure in production is not having a way to answer “why did this conversation cost thirty euros”.
Recommended minimum path: OpenTelemetry in the model SDK, Langfuse-type backend, five basic business metrics on a visible dashboard, simple automatic evaluation over a traffic sample.

Why classical systems aren’t enough

Traditional observability was built around synchronous HTTP services. Agents break the mold in three ways:

Dozens of chained calls: a single user input can trigger dozens of chained model calls with branches and loops only known after execution.
Explicit economic cost: each call carries a token cost that matters as much as latency.
Content relevant for debugging: unlike a normal API JSON, input and output content is necessary to understand why the agent made one decision over another.

First layer: one trace per run

The first thing to have is a trace per agent run, with each model call and each tool call as a nested span. OpenTelemetry consolidated during 2025 semantic conventions for generative AI. All major LLM SDKs (OpenAI, Anthropic, Google, Azure) already have automatic instrumentation emitting these spans.

What to capture on each model-call span: model name, sampling parameters, input and output token counts, estimated cost, time-to-first-token if streaming, total time, and critically, input messages and output text or function calls. The last point is sensitive because those contents may include personal data.

Second layer: aggregated metrics

The five metrics to always instrument:

Cost per conversation in provider currency.
End-to-end latency as seen by the user.
Success completion rate versus abandonment or error.
Mean number of steps per conversation.
Tool-use rate per step.

Third layer: production evaluations

Agents need one more layer: evaluations running on real conversations or samples to measure quality. Knowing a conversation ended isn’t enough; you need to know whether it ended well. A pyramid approach works: a small fraction of conversations evaluated by humans, a larger fraction by a judge model, and the rest with cheap heuristic metrics.

The most common failure pattern

The most expensive failure: not having a way to answer “why did this conversation cost thirty euros”. Without that traceability the team can’t prevent it from recurring, and the monthly LLM bill becomes a black box growing without explanation.

The cure is to instrument from day one, even if it feels like overhead. Agents start small and grow fast.

A minimum path

python

from opentelemetry import trace
from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor
import openai

OpenAIInstrumentor().instrument()
tracer = trace.get_tracer("agent.bookings")

with tracer.start_as_current_span("conversation", attributes={"user.id": uid}) as span:
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        tools=tools,
    )
    span.set_attribute("agent.steps", len(response.choices[0].message.tool_calls or []))

That quartet covers eighty percent of the value.

Conclusion

Instrument early, instrument with open conventions, and revisit when more consolidated versions appear. Observability is not an add-on built when the agent is already in production; it is part of the design from day one. There is no excuse for running agents blind.

Was this useful?

[Total: 0 · Average: 0]

Post Views: 138

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

AI agent observability: what to instrument first

Key takeaways

Why classical systems aren’t enough

First layer: one trace per run

Second layer: aggregated metrics

Third layer: production evaluations

The most common failure pattern

A minimum path

Conclusion

Related posts

“EU AI Act 2026: a technical checklist for Spanish CTOs”

Agent observability with OpenTelemetry GenAI semconv in 2026

How to install and tune oMLX on M5 Max 128 GB

Multi-agent systems: LangGraph vs CrewAI vs Autogen in 2026