LLM Observability: Traces, Costs, and Quality

Gráficos de dashboard analíticos con líneas de métricas en pantalla moderna

LLM observability has specific requirements vs traditional apps: track prompt/response pairs, token costs per call, quality metrics (not just uptime). Traditional Prometheus/Grafana falls short. Specialised tools: Langfuse, LangSmith, Helicone, Arize Phoenix. This article covers options and patterns.

What to Track

In LLM apps:

  • Prompts/responses: full capture for debugging.
  • Token usage: input + output per model.
  • Costs: running total per user/feature/team.
  • Latency: p50/p95/p99 per model.
  • Errors: rate limits, timeouts.
  • Quality metrics: Ragas scores, user feedback.
  • A/B tests: compare prompts/models.

Langfuse

Langfuse: open-source + hosted.

from langfuse.decorators import observe
from langfuse.openai import openai

@observe()
def ask_question(q):
    return openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": q}]
    )

Auto-traces LLM calls. Dashboard with prompts, responses, costs.

Features:

  • Self-hosted option.
  • Prompt management.
  • Evaluation integrations.
  • User feedback capture.

LangSmith

LangSmith (LangChain):

  • Deep LangChain integration.
  • Trace chains + agents.
  • Evaluation tooling.
  • Production monitoring.

Commercial. Generous free tier.

Helicone

Helicone:

  • Proxy approach: OpenAI proxied via Helicone.
  • Zero code change: just change base URL.
  • Monitoring + caching.
client = OpenAI(base_url="https://oai.helicone.ai/v1")

Easy integration. Self-hosted option.

Arize Phoenix

Phoenix:

  • Open-source.
  • LLM + traditional ML.
  • Embedding drift detection.
  • Evaluation frameworks.

Strong for teams from classical ML.

Cost Tracking

Patterns:

  • Per request: token count × price.
  • Per user: aggregate for quotas.
  • Per feature: which workflows cost most.
  • Per model: compare efficiency.
  • Trends: detect cost anomalies.

Essential for budget control.

Quality Metrics

Beyond uptime:

  • Faithfulness: answer backed by context (RAG).
  • Relevance: answer relevant to question.
  • Hallucination detection.
  • Toxicity/safety.
  • User satisfaction: thumbs up/down.
  • Task success rate: action completed?

Automated via Ragas, TruLens. Humans review samples.

OpenTelemetry Integration

Emerging standard: OTel semantic conventions for LLM traces.

from opentelemetry import trace
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("llm_call") as span:
    span.set_attribute("llm.model", "gpt-4o")
    span.set_attribute("llm.prompt.tokens", 150)
    # ...

Future: standard OTel LLM conventions → tool-agnostic observability.

Prompt Management

  • Version prompts: like code.
  • A/B test: serve multiple variants.
  • Rollback: if new prompt degrades.
  • Centralise: avoid hardcoded prompts.

Most tools integrate this.

Feedback Loops

Critical user feedback:

  • Thumbs up/down on responses.
  • Explicit ratings.
  • Implicit signals: time spent, engagement.
  • Flag bad responses for review.

Use feedback to:

  • Identify failure patterns.
  • Improve prompts.
  • Fine-tune (eventually).

Data Privacy

  • PII scrubbing: remove sensitive before logging.
  • Retention limits: GDPR compliance.
  • User consent: inform about logging.
  • Regional compliance: EU hosting.

Langfuse, Phoenix self-hosted options cover this.

Alerts

Actionable:

  • Cost spike: >50% daily avg.
  • Latency p95 > threshold.
  • Error rate > 1%.
  • Quality drop: eval scores dropping.
  • Hallucination spike.

Standard alerting integrates.

Dashboard Examples

Typical dashboards:

  • Ops: uptime, latency, errors.
  • Finance: cost trends, budget tracking.
  • Product: usage, feature adoption.
  • Quality: eval metrics trends.

Each team sees relevant.

Open-Source vs Commercial

  • Open-source Langfuse: full self-host features.
  • LangSmith: LangChain-focused, commercial.
  • Open-source Helicone: free core features.
  • Datadog LLM: integrated with existing Datadog.

For open-source preference, Langfuse/Phoenix. For managed, LangSmith/Helicone.

Integration with Existing Stack

LLM observability doesn’t replace traditional APM:

  • Datadog/New Relic: app-level traces.
  • Langfuse: LLM-specific spans.
  • Correlation: same trace IDs.

Layered observability.

Minimum viable LLM observability:

  1. Capture all prompts/responses (sampling OK for high volume).
  2. Track token costs per request.
  3. Measure latency distributions.
  4. User feedback mechanism.
  5. Error logging.
  6. Monthly review dashboards.

Simple but covers essentials.

Conclusion

LLM observability isn’t optional in production. Prompt/response tracking, cost management, quality metrics are essential to operate LLM apps sustainably. Tools mature — Langfuse for open self-host, LangSmith commercial, Helicone simplicity. Combined with eval frameworks (Ragas), create closed-loop improvement. For teams just starting, pick one tool + capture basics. Iteration follows.

Follow us on jacar.es for more on LLM observability, monitoring, and ML production.

Entradas relacionadas