LLM observability has specific requirements vs traditional apps: track prompt/response pairs, token costs per call, quality metrics (not just uptime). Traditional Prometheus/Grafana falls short. Specialised tools: Langfuse, LangSmith, Helicone, Arize Phoenix. This article covers options and patterns.
What to Track
In LLM apps:
- Prompts/responses: full capture for debugging.
- Token usage: input + output per model.
- Costs: running total per user/feature/team.
- Latency: p50/p95/p99 per model.
- Errors: rate limits, timeouts.
- Quality metrics: Ragas scores, user feedback.
- A/B tests: compare prompts/models.
Langfuse
Langfuse: open-source + hosted.
from langfuse.decorators import observe
from langfuse.openai import openai
@observe()
def ask_question(q):
return openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": q}]
)
Auto-traces LLM calls. Dashboard with prompts, responses, costs.
Features:
- Self-hosted option.
- Prompt management.
- Evaluation integrations.
- User feedback capture.
LangSmith
LangSmith (LangChain):
- Deep LangChain integration.
- Trace chains + agents.
- Evaluation tooling.
- Production monitoring.
Commercial. Generous free tier.
Helicone
- Proxy approach: OpenAI proxied via Helicone.
- Zero code change: just change base URL.
- Monitoring + caching.
client = OpenAI(base_url="https://oai.helicone.ai/v1")
Easy integration. Self-hosted option.
Arize Phoenix
- Open-source.
- LLM + traditional ML.
- Embedding drift detection.
- Evaluation frameworks.
Strong for teams from classical ML.
Cost Tracking
Patterns:
- Per request: token count × price.
- Per user: aggregate for quotas.
- Per feature: which workflows cost most.
- Per model: compare efficiency.
- Trends: detect cost anomalies.
Essential for budget control.
Quality Metrics
Beyond uptime:
- Faithfulness: answer backed by context (RAG).
- Relevance: answer relevant to question.
- Hallucination detection.
- Toxicity/safety.
- User satisfaction: thumbs up/down.
- Task success rate: action completed?
Automated via Ragas, TruLens. Humans review samples.
OpenTelemetry Integration
Emerging standard: OTel semantic conventions for LLM traces.
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("llm_call") as span:
span.set_attribute("llm.model", "gpt-4o")
span.set_attribute("llm.prompt.tokens", 150)
# ...
Future: standard OTel LLM conventions → tool-agnostic observability.
Prompt Management
- Version prompts: like code.
- A/B test: serve multiple variants.
- Rollback: if new prompt degrades.
- Centralise: avoid hardcoded prompts.
Most tools integrate this.
Feedback Loops
Critical user feedback:
- Thumbs up/down on responses.
- Explicit ratings.
- Implicit signals: time spent, engagement.
- Flag bad responses for review.
Use feedback to:
- Identify failure patterns.
- Improve prompts.
- Fine-tune (eventually).
Data Privacy
- PII scrubbing: remove sensitive before logging.
- Retention limits: GDPR compliance.
- User consent: inform about logging.
- Regional compliance: EU hosting.
Langfuse, Phoenix self-hosted options cover this.
Alerts
Actionable:
- Cost spike: >50% daily avg.
- Latency p95 > threshold.
- Error rate > 1%.
- Quality drop: eval scores dropping.
- Hallucination spike.
Standard alerting integrates.
Dashboard Examples
Typical dashboards:
- Ops: uptime, latency, errors.
- Finance: cost trends, budget tracking.
- Product: usage, feature adoption.
- Quality: eval metrics trends.
Each team sees relevant.
Open-Source vs Commercial
- Open-source Langfuse: full self-host features.
- LangSmith: LangChain-focused, commercial.
- Open-source Helicone: free core features.
- Datadog LLM: integrated with existing Datadog.
For open-source preference, Langfuse/Phoenix. For managed, LangSmith/Helicone.
Integration with Existing Stack
LLM observability doesn’t replace traditional APM:
- Datadog/New Relic: app-level traces.
- Langfuse: LLM-specific spans.
- Correlation: same trace IDs.
Layered observability.
Recommended Setup
Minimum viable LLM observability:
- Capture all prompts/responses (sampling OK for high volume).
- Track token costs per request.
- Measure latency distributions.
- User feedback mechanism.
- Error logging.
- Monthly review dashboards.
Simple but covers essentials.
Conclusion
LLM observability isn’t optional in production. Prompt/response tracking, cost management, quality metrics are essential to operate LLM apps sustainably. Tools mature — Langfuse for open self-host, LangSmith commercial, Helicone simplicity. Combined with eval frameworks (Ragas), create closed-loop improvement. For teams just starting, pick one tool + capture basics. Iteration follows.
Follow us on jacar.es for more on LLM observability, monitoring, and ML production.