Observabilidad de LLM: trazas, costes y calidad

Gráficos de dashboard analíticos con líneas de métricas en pantalla moderna

Observabilidad LLM tiene requirements específicos vs traditional apps: track prompt/response pairs, token costs per call, quality metrics (not just uptime). Traditional Prometheus/Grafana falls short. Tools specialized: Langfuse, LangSmith, Helicone, Arize Phoenix. Este artículo cubre options y patterns.

Qué trackear

En LLM apps:

  • Prompts/responses: full capture para debugging.
  • Token usage: input + output per model.
  • Costs: running total per user/feature/team.
  • Latency: p50/p95/p99 per model.
  • Errors: rate limits, timeouts.
  • Quality metrics: Ragas scores, user feedback.
  • A/B tests: compare prompts/models.

Langfuse

Langfuse: open-source + hosted.

from langfuse.decorators import observe
from langfuse.openai import openai

@observe()
def ask_question(q):
    return openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": q}]
    )

Auto-traces LLM calls. Dashboard con prompts, responses, costs.

Features:

  • Self-hosted option.
  • Prompt management.
  • Evaluation integrations.
  • User feedback capture.

LangSmith

LangSmith (LangChain):

  • Deep LangChain integration.
  • Trace chains + agents.
  • Evaluation tooling.
  • Production monitoring.

Commercial. Generous free tier.

Helicone

Helicone:

  • Proxy approach: OpenAI proxied via Helicone.
  • Zero code change: just change base URL.
  • Monitoring + caching.
client = OpenAI(base_url="https://oai.helicone.ai/v1")

Easy integration. Self-hosted option.

Arize Phoenix

Phoenix:

  • Open-source.
  • LLM + traditional ML.
  • Embedding drift detection.
  • Evaluation frameworks.

Strong for teams coming from classical ML.

Cost tracking

Patterns:

  • Per request: token count × price.
  • Per user: aggregate for quotas.
  • Per feature: which workflows cost most.
  • Per model: compare efficiency.
  • Trends: detect cost anomalies.

Essential for budget control.

Quality metrics

Beyond uptime:

  • Faithfulness: answer backed by context (RAG).
  • Relevance: answer relevant to question.
  • Hallucination detection.
  • Toxicity/safety.
  • User satisfaction: thumbs up/down.
  • Task success rate: action completed?

Automated via Ragas, TruLens. Humans review samples.

OpenTelemetry integration

Emerging standard: OTel semantic conventions for LLM traces.

from opentelemetry import trace
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("llm_call") as span:
    span.set_attribute("llm.model", "gpt-4o")
    span.set_attribute("llm.prompt.tokens", 150)
    # ...

Future: OTel standard LLM conventions → tool-agnostic observability.

Prompt management

  • Version prompts: like code.
  • A/B test: serve multiple variants.
  • Rollback: if new prompt degrades.
  • Centralize: avoid hardcoded prompts.

Most tools integrate this.

Feedback loops

User feedback critical:

  • Thumbs up/down en responses.
  • Explicit ratings.
  • Implicit signals: time spent, engagement.
  • Flag bad responses for review.

Use feedback para:

  • Identify failure patterns.
  • Improve prompts.
  • Fine-tune (eventually).

Data privacy

  • PII scrubbing: remove sensitive before logging.
  • Retention limits: comply GDPR.
  • User consent: inform about logging.
  • Regional compliance: EU hosting.

Langfuse, Phoenix self-hosted options cover this.

Alerts

Actionable:

  • Cost spike: >50% daily avg.
  • Latency p95 > threshold.
  • Error rate > 1%.
  • Quality drop: eval scores dropping.
  • Hallucination spike.

Standard alerting integrates.

Dashboard examples

Typical dashboards:

  • Ops: uptime, latency, errors.
  • Finance: cost trends, budget tracking.
  • Product: usage, feature adoption.
  • Quality: eval metrics trends.

Each team sees relevant.

Open-source vs commercial

  • Langfuse open-source: full features self-host.
  • LangSmith: LangChain-focused, commercial.
  • Helicone open-source: core features free.
  • Datadog LLM: integrated con existing Datadog.

Para open-source preference, Langfuse/Phoenix. Para managed, LangSmith/Helicone.

Integration con existing stack

LLM observability no reemplaza APM tradicional:

  • Datadog/New Relic: app-level traces.
  • Langfuse: LLM-specific spans.
  • Correlation: same trace IDs.

Layered observability.

Setup recomendado

Minimum viable LLM observability:

  1. Capture all prompts/responses (sampling OK for high volume).
  2. Track token costs per request.
  3. Measure latency distributions.
  4. User feedback mechanism.
  5. Error logging.
  6. Monthly review dashboards.

Simple pero covers essentials.

Conclusión

Observabilidad LLM no es optional en producción. Prompt/response tracking, cost management, quality metrics esencial para operar apps LLM sustentable. Tools mature — Langfuse para self-host open, LangSmith commercial, Helicone simplicity. Combined con eval frameworks (Ragas), crean closed-loop improvement. Para teams just starting, pick one tool + capture basics. Iteration follows.

Síguenos en jacar.es para más sobre LLM observability, monitoring y ML production.

Entradas relacionadas