Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Inteligencia Artificial Metodologías

LLM Observability: Traces, Costs, and Quality

LLM Observability: Traces, Costs, and Quality

Actualizado: 2026-05-03

LLM-based applications have an observability profile different from traditional applications. A well-instrumented REST API exposes latency, error rate, and throughput — metrics Prometheus and Grafana handle perfectly. An LLM application additionally needs to capture prompt/response pairs to debug hallucinations, track real token cost per feature and per user, and measure response quality in ways conventional infrastructure instruments do not cover. The LLM observability stack in 2024 already has mature tools for these three planes.

Key takeaways

  • LLM observability has three distinct planes: prompt traces, token costs, and response quality. Each requires specific instrumentation.
  • Langfuse is open-source and self-hosteable; LangSmith is natively integrated with LangChain; Helicone is the simplest proxy to deploy.
  • The most scalable evaluation pattern combines an automatic LLM-judge for quality with periodic human samples to calibrate the judge.
  • Cost per feature or per user requires passing explicit metadata in each call — it is not automatic in any tool.
  • Prometheus + vLLM’s native metrics cover the infrastructure plane; specialised tools cover the application plane.

What to track in an LLM application

Infrastructure metrics — p95 latency, error rate, token throughput — are necessary but not sufficient. What the LLM layer adds:

Prompt/response traces: capture the full prompt, response, model used, temperature, and token count on each call. Essential for debugging unexpected behaviour — you cannot reproduce a hallucination without the exact prompt.

Token costs: input and output for each call, with per-model pricing, aggregated by user, session, and product feature. Without this tracking, a model’s production cost is opaque until the bill arrives.

Response quality: metrics that vary by application — RAG fidelity, response relevance, absence of hallucinations, compliance with expected format. These metrics are not automatic; they require either human evaluation or an LLM judge.

Decomposed latency: time to first token (TTFT) and end-to-end latency. In streaming applications, TTFT is what users perceive as “fast” or “slow”; total latency is secondary.

Langfuse: open-source and self-hosteable

Langfuse[1] has become the most complete open-source LLM observability tool available. It can be deployed on Kubernetes (official Helm chart) or used as a managed service. Integration is direct:

python
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

langfuse = Langfuse()

@observe()
def process_query(user_id: str, query: str) -> str:
    langfuse_context.update_current_observation(
        metadata={"user_id": user_id, "feature": "search"},
    )

    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}],
    )
    return response.choices[0].message.content

result = process_query("user-123", "What is eBPF?")

The @observe decorator automatically captures inputs, outputs, latency, and call cost. Traces appear in the Langfuse interface with the full prompt, response, token count, and calculated monetary cost.

For RAG in production, Langfuse has native support for fidelity and relevance evaluation with an annotation flow combining automatic LLM evaluation and human feedback.

LangSmith: native LangChain integration

LangSmith[2] is LangChain’s observability tool. The main advantage is transparent integration with LangChain and LangGraph: if your application uses LangChain chains or graphs, LangSmith traces them without additional instrumentation.

python
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "..."
os.environ["LANGCHAIN_PROJECT"] = "my-production-app"

# From here, all LangChain executions are traced
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_template("Answer: {question}")
chain = prompt | llm

result = chain.invoke({"question": "What is PagedAttention?"})

Helicone: the simplest proxy

Helicone[3] takes a different approach: instead of SDKs or decorators, it works as an HTTP proxy between your application and the model’s API. Changing the base URL is the only code change.

The infrastructure plane: Prometheus + native metrics

Specialised LLM tools cover the application plane, but the infrastructure plane remains Prometheus. For vLLM or TGI-based services, native Prometheus metrics cover:

  • Requests in flight and queued.
  • GPU KV cache occupancy.
  • Time to first token (p50, p95, p99).
  • End-to-end latency.
  • Error rates by type.

The official Grafana dashboard for vLLM combines these metrics with alerts on growing queue occupancy — the most important signal of insufficient capacity.

Prometheus logo, the monitoring system that covers the infrastructure plane of LLM applications while tools like Langfuse cover the application and quality plane

Scalable quality evaluation patterns

The biggest challenge in LLM observability is not operational tracking — that is solved — but quality evaluation at scale. Human evaluations are expensive; automatic ones require an LLM judge that can err. The best-scaling pattern combines both:

  • Continuous automatic evaluation: an LLM (GPT-4o mini, Claude Haiku) evaluates each response against defined criteria — relevance, fidelity, format. Cost: 0.1-0.5 cents per evaluation.
  • Periodic human sampling: humans annotate 1-5% of evaluations to calibrate the automatic judge. If the automatic judge diverges from humans beyond a defined threshold, the judge prompt is adjusted.
  • Regression evaluation: when the main prompt or model changes, a reference set is re-evaluated to detect regressions.

Cost per feature: the tracking most needed

The most useful data for product decisions — “what does feature X cost per active user?” — is what no tool provides automatically. It requires passing explicit metadata in each call and implementing an aggregation pipeline.

With that metadata, Langfuse lets you filter and aggregate cost by feature, plan, or user segment. Without explicit instrumentation, you only have total cost — useless for pricing decisions.

Conclusion

Mature LLM observability has three planes requiring separate instrumentation: infrastructure (Prometheus + server native metrics), application (Langfuse, LangSmith, or Helicone for traces and costs), and quality (automatic evaluation + human calibration). Applications that only have the first plane instrumented have GPU visibility but are blind to quality and cost per feature — the two data points that matter most for product and business decisions.

Was this useful?
[Total: 0 · Average: 0]
  1. Langfuse
  2. LangSmith
  3. Helicone

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.