agentes genai observabilidad opentelemetry trazas

Agent observability with OpenTelemetry GenAI semconv in 2026

May 18, 2026 25 min read 39 reads

Table of contents

Key takeaways
What the GenAI semconv covers in 2026
Key attributes: gen_ai.system, request, response, usage
Spans for tool use and MCP servers
Python instrumentation with the Anthropic SDK
Collection with OTel Collector → Tempo
Dashboard: latency, tokens, errors per agent
Conclusion

The OpenTelemetry GenAI semconv specification now defines, in 2026, the standard attributes for instrumenting LLM calls, tool executions, and agent operations; this post shows how to apply it to a real Python agent built on the Anthropic SDK, collect it with an OTel Collector, and query it in Grafana Tempo. See also: the complete guide to the mcp model context protocol that frames the MCP piece of the instrumentation.

Key takeaways

The OTel GenAI spec went through its stabilisation window in 2025 and, by 2026, span names (chat, execute_tool, invoke_agent) and key attributes are consistent across providers.
Two attributes identify the provider: the modern gen_ai.provider.name and the historical gen_ai.system. In 2026 it pays to emit both: many collectors and dashboards still read the second.
Classic tool use uses gen_ai.tool.name and gen_ai.tool.call.id inside an execute_tool span. MCP servers have a dedicated sub-spec with mcp.method.name, mcp.protocol.version, and mcp.session.id.
The standard collection layer is OTel Collector with the otlp receiver, memory_limiter + batch processors, and an otlphttp exporter pointing at Tempo. Without memory_limiter the collector falls over on the day you least expect it.
TraceQL accepts filters on GenAI attributes: { name = "chat" && span.gen_ai.usage.input_tokens > 1000 } is a valid query against Tempo.
Anthropic, LangSmith, and Braintrust all emit GenAI semconv attributes today, so a panel built against them keeps working when you swap providers or agent frameworks.

What the GenAI semconv covers in 2026

The OpenTelemetry GenAI semantic conventions^[1] cover five signal families: input/output events, exceptions, metrics, model-operation spans, and agent-operation spans. On top of that base, the spec adds vendor-specific conventions (Anthropic, OpenAI, Azure AI Inference, AWS Bedrock) and, since late 2025, a dedicated sub-spec for MCP servers. The formal status of the gen-ai group is “Development” in 2026 with several areas in release-candidate, but in practice the span names and key attributes have been stable for over a year and the commercial SDKs emit them with the same shape.

What matters operationally is that this turns the agent dashboard into a portable setup: a chat p99 panel or an execute_tool error-rate alert doesn’t need rewriting when you switch from Claude to GPT, or when you migrate from manual instrumentation to a library like OpenLLMetry. The investment that on prior stacks had to be made twice — once to instrument proprietary function-calling and again to ETL logs — is now a single investment.

It helps to separate two span classes. Model-operation spans are individual LLM calls: chat, text_completion, embeddings. Agent-operation spans are the agent operations around them: invoke_agent, create_agent, execute_tool. The result is that a typical turn’s trace has a hierarchy: an invoke_agent span contains one or more chat calls, and any chat that asks for a tool hangs an execute_tool from itself. The pillar mcp model context protocol explains the context this tree sits inside.

Key attributes: gen_ai.system, request, response, usage

The minimum attribute set for a chat span splits into four orthogonal buckets. Provider identification: gen_ai.provider.name with a canonical value (anthropic, openai, azure.ai.openai, aws.bedrock) and, for back-compat with older instrumentations, gen_ai.system with a human value (Anthropic). Operation: gen_ai.operation.name ∈ {chat, text_completion, embeddings, execute_tool, invoke_agent, create_agent}. Request: gen_ai.request.model, optionally gen_ai.request.max_tokens, temperature, top_p, stop_sequences. Response: gen_ai.response.model (may differ if the provider routes to a variant), gen_ai.response.id, gen_ai.response.finish_reasons. Usage: gen_ai.usage.input_tokens and gen_ai.usage.output_tokens — the pair that anchors almost every cost dashboard.

python

# app/observability.py — chat() wrapper with GenAI semconv attributes
from contextlib import contextmanager
from opentelemetry import trace
import anthropic

tracer = trace.get_tracer("agente-anthropic-sdk")
client = anthropic.Anthropic()


@contextmanager
def chat(model: str, messages: list[dict]):
    with tracer.start_as_current_span(f"chat {model}", kind=trace.SpanKind.CLIENT) as span:
        span.set_attribute("gen_ai.provider.name", "anthropic")
        span.set_attribute("gen_ai.system", "Anthropic")  # back-compat
        span.set_attribute("gen_ai.operation.name", "chat")
        span.set_attribute("gen_ai.request.model", model)
        try:
            response = client.messages.create(model=model, max_tokens=1024, messages=messages)
            span.set_attribute("gen_ai.response.model", response.model)
            span.set_attribute("gen_ai.response.id", response.id)
            span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
            span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
            yield response
        except Exception as exc:
            span.record_exception(exc)
            span.set_status(trace.Status(trace.StatusCode.ERROR))
            raise

Two details earn their keep. The span name follows the {operation.name} {model} pattern when the model is known up front (chat claude-opus-4-7), or just {operation.name} when it isn’t. Cardinality stays low because the model rarely changes. And exceptions are recorded with span.record_exception() plus span.set_status(StatusCode.ERROR) before re-raising — without that, TraceQL filters by status = error find nothing.

If you use Anthropic with prompt caching, you should also populate two vendor-specific attributes under the same prefix: gen_ai.usage.input_tokens.cache_read and gen_ai.usage.input_tokens.cache_write. They’re extensions — not strict spec — but mature SDKs already expose them and they’re what lets you verify in production that effective caching is hitting.

Spans for tool use and MCP servers

Classic tool use — function-calling without MCP — is instrumented with an execute_tool child span. Required attributes are gen_ai.tool.name (the name the model sees) and gen_ai.tool.call.id (the API-returned identifier: toolu_... on Anthropic, call_... on OpenAI). That makes log-trace correlation in Tempo work cleanly when a tool misbehaves.

python

# app/tool_span.py — execute_tool child span
from opentelemetry import trace
tracer = trace.get_tracer("agente-anthropic-sdk")


def run_tool(tool_use_block, fn) -> str:
    with tracer.start_as_current_span(f"execute_tool {tool_use_block.name}") as span:
        span.set_attribute("gen_ai.operation.name", "execute_tool")
        span.set_attribute("gen_ai.tool.name", tool_use_block.name)
        span.set_attribute("gen_ai.tool.call.id", tool_use_block.id)
        span.set_attribute("gen_ai.tool.type", "function")
        return fn(tool_use_block.input)

When the tool is an MCP server, things change: the spec sub-page at opentelemetry.io semconv GenAI MCP^[2] defines its own set extending the generic one. Required is mcp.method.name (with values tools/call, tools/list, initialize…); recommended are mcp.protocol.version (e.g. 2025-06-18), mcp.session.id, and, where applicable, mcp.resource.uri. The server-side span name follows the pattern {mcp.method.name} {target} — tools/call create_ticket, for example — with span kind = SERVER and a parent-child relationship to the client.

python

# mcp_server/observability.py — server-side MCP span
from opentelemetry import trace
tracer = trace.get_tracer("mcp-dominio-server")

def traced_tools_call(tool_name: str, session_id: str, fn, *args, **kwargs):
    with tracer.start_as_current_span(
        f"tools/call {tool_name}", kind=trace.SpanKind.SERVER
    ) as span:
        span.set_attribute("mcp.method.name", "tools/call")
        span.set_attribute("mcp.protocol.version", "2025-06-18")
        span.set_attribute("mcp.session.id", session_id)
        span.set_attribute("gen_ai.tool.name", tool_name)
        return fn(*args, **kwargs)

The operational consequence is that an MCP-mediated turn produces a four-level trace: chat → execute_tool → tools/call → POST https://.... Each level lives in a different stack — Anthropic client, agent, MCP server, HTTP backend — but traceparent propagates over the JSON-RPC transport and the server’s HTTP client, so the tree comes out complete. The sibling cluster on the Anthropic SDK tutorial shows the agent code that produces this hierarchy.

Python instrumentation with the Anthropic SDK

In 2026, commercial SDKs emit GenAI semconv attributes in three ways. The official Anthropic SDK docs^[3] describe the shape of response.usage — where the input_tokens/output_tokens pair comes from — but don’t auto-instrument; the canonical pattern is the manual wrapper above, or the opentelemetry-instrumentation-anthropic package from the OpenLLMetry family for auto-instrumentation. LangSmith exports spans to OTLP when you set LANGSMITH_TRACING=true and an OTLP endpoint, with gen_ai.system and the gen_ai.usage.* family already populated. Braintrust, since mid-2025, emits GenAI-semconv-conformant spans through its SDK and offers a direct OTLP integration.

The practical consequence is that the dashboard works even if the stack underneath changes. A chat p99 panel, an alert on gen_ai.usage.input_tokens > 8000, or a counter on execute_tool errors are built once and keep working if the platform team replaces Anthropic with Bedrock tomorrow, or the evals team adds Braintrust to the stack. The full implementation of the OTel wrapper and the Collector configuration lives in the reference repository at github.com/jacarsystems/jacar-anthropic-sdk-demo^[4] — specifically under app/observability.py and ops/otel-collector.yaml.

If you’re starting from scratch, pick one of the two paths and don’t mix them. Manual instrumentation gives you full control over the span shape and is ideal when the agent is a small piece in a larger service that already has its own tracing policy. Auto-instrumentation with OpenLLMetry or equivalent is the fast path when the agent is the main service and you’d rather invest the time in dashboards than in boilerplate. Combining the two usually ends in duplicate spans that clutter the tree and become a real pain to debug.

Collection with OTel Collector → Tempo

The other side of instrumentation is collection. The OTel Collector is the agent that receives spans from the application, processes them, and exports them. The canonical 2026 shape is three blocks: OTLP receivers (gRPC on :4317, HTTP on :4318), memory_limiter and batch processors, and an otlphttp exporter pointing at Tempo. Three details matter more than they look. memory_limiter is not optional: without it, a traffic spike crashes the collector with OOM before the orchestrator can react. The batch should group at least five seconds so network cost stays reasonable. And the debug exporter (formerly logging) helps debug the local integration without touching application logic.

yaml

# ops/otel-collector.yaml — drop-in for /etc/otelcol-contrib/config.yaml
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 25
  batch:
    send_batch_size: 8192
    timeout: 5s

exporters:
  otlphttp/tempo:
    endpoint: http://tempo:4318
    tls: { insecure: true }
  debug: { verbosity: basic }

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp/tempo, debug]

For a real deployment, two extensions deserve attention. The spanmetrics connector derives RED metrics (rate, errors, duration) from incoming spans — meaning you get chat or execute_tool p99 in Prometheus without touching application code. And the tail_sampling processor lets you drop trivial traces when volume bites: for example, keep only traces with span.gen_ai.usage.input_tokens > 0 or with errors. Default decisions on which open stack to use track the recommendations in the post on observability tools in 2026: Prometheus for metrics, Loki for logs, Tempo for traces, Grafana for everything visible.

A note on logs: OTel log instrumentation is covered separately — see OpenTelemetry: unifying logs, metrics, and traces — and it’s worth standing it up in parallel if you want to correlate the application’s console.log output with agent spans via trace_id and span_id. That’s what enables Tempo’s “Logs for this span” one-click jump.

Dashboard: latency, tokens, errors per agent

With well-attributed spans and a collector routing to Tempo, TraceQL queries fall out almost for free. The syntax is the one documented at grafana.com/oss/tempo^[5] and the official TraceQL reference^[6], with curly-braces, attributes prefixed by span. or resource., and && / || combiners. Three queries cover 80% of operational work:

bash

# Tempo TraceQL — slow chats above 5s
{ name = "chat claude-opus-4-7" && duration > 5s }

# tool-use errors in the last day
{ name =~ "execute_tool .*" && status = error }

# turns with high token usage (unexpected spend)
{ name = "chat claude-opus-4-7" && span.gen_ai.usage.input_tokens > 8000 }

To turn these filters into living panels, the usual route is to convert TraceQL to metrics via the spanmetrics connector and, on Mimir or Prometheus, define a stable recording rule. That decouples the dashboard from the scan cost of a large TraceQL query:

yaml

# Mimir/Prometheus recording rule — chat p99 by model
groups:
  - name: jacar-genai
    interval: 30s
    rules:
      - record: gen_ai_chat_p99_latency_seconds
        expr: histogram_quantile(0.99,
          sum by (le, gen_ai_request_model) (
            rate(traces_span_metrics_duration_seconds_bucket{span_name="chat"}[5m])
          ))

On that base, the dashboard I’d recommend in 2026 has four panels. Latency (gen_ai_chat_p99_latency_seconds per model and per agent). Tokens (sum by (gen_ai_request_model) (rate(gen_ai_usage_input_tokens[5m])) for input and output, two series per model). Errors (rate of status = error on chat and execute_tool spans, kept separate because root causes differ). MCP (tools/call rate per server, p99 latency, errors). If the team plans to run evals with Braintrust or LangSmith in parallel, having these four views in the open over the same GenAI semconv attributes prevents each tool from spinning up its own narrative. The dashboard deployment pattern overlaps with the recipes in multi-agent systems: LangGraph, CrewAI, and AutoGen, because once several agents coordinate, the agent_name dimension naturally appears as the gen_ai.agent.name attribute and becomes the dashboard’s main axis.

Conclusion

Agent observability in 2026 is a solved problem as long as you start from the OTel GenAI spec: instrument once with gen_ai.provider.name, gen_ai.operation.name, gen_ai.request.model, and the usage.input_tokens / usage.output_tokens pair, collect with the OTel Collector into Tempo, and query with TraceQL. The four pieces — provider SDK, span wrapper, Collector, Tempo — are orthogonal and each is replaceable on its own without touching the rest. The result is a dashboard that survives changes of model, provider, and agent framework, and that’s what makes the investment worth it.

Was this useful?

[Total: 0 · Average: 0]

Post Views: 39

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Agent observability with OpenTelemetry GenAI semconv in 2026

Key takeaways

What the GenAI semconv covers in 2026

Key attributes: gen_ai.system, request, response, usage

Spans for tool use and MCP servers

Python instrumentation with the Anthropic SDK

Collection with OTel Collector → Tempo

Dashboard: latency, tokens, errors per agent

Conclusion

Related posts

“EU AI Act 2026: a technical checklist for Spanish CTOs”

How to install and tune oMLX on M5 Max 128 GB

Multi-agent systems: LangGraph vs CrewAI vs Autogen in 2026

How to build a production-ready agent with the Anthropic SDK, step by step