Helicone: LLM observability in one line

Helicone is an open-source LLM observability platform you integrate by changing a single line: your client's base URL. It logs cost, latency and tokens for every call, adds caching and rate limiting, and you can self-host it with Docker. It is Apache-2.0 and has nearly 6,000 GitHub stars, though since March 2026 it sits in maintenance mode.

July 17, 2026 9 min 7

Artificial Intelligence

Langfuse: self-hosted agent observability

Langfuse is an open-source platform to observe, debug and evaluate AI applications and agents. You can self-host it with Docker Compose on Postgres, ClickHouse, Redis and S3 storage, and its Python SDK, built on OpenTelemetry, captures traces, spans and generations with their cost and latency. This guide explains how to deploy it and instrument an agent.

July 17, 2026 9 min 4

Artificial Intelligence

Agent observability with OpenTelemetry GenAI semconv in 2026

The OTel GenAI spec stabilizes attributes for LLMs, tools, and agents. Practical Python implementation with Anthropic + Grafana Tempo.

May 18, 2026 12 min 383

Methodologies

AI-integrated DevOps tools in my daily flow

After fourteen months testing AI-integrated DevOps tools across several teams, the stack that stays is small: Claude Code, Cursor, and Aider for code; PagerDuty AIOps, Datadog Bits AI, and Grafana Assistant for alert triage; and OpenTofu with OPA for infrastructure generation bounded by policy rules.

April 28, 2026 4 min 315 4.0

Artificial Intelligence

AI agent incidents: recovery runbooks that work

AI agents fail in production, and what matters is how you respond in the first twenty minutes. This runbook covers severity classification, isolating before investigating, purging contaminated memory, communicating without inventing facts, and turning every incident into a regression test before closing it as done.

April 28, 2026 4 min 243 4.7

Artificial Intelligence

Production-grade agent evaluations: the framework that works

Después de año y medio llenando tableros con agentes en producción, la pregunta que separa equipos que envían fiable de los que van a ciegas sigue siendo la misma: ¿cómo mides que el agente está funcionando?

April 22, 2026 7 min 275 4.3

Artificial Intelligence

AI incident postmortems: what they have taught us

A selection of postmortems published between 2025 and 2026 by teams running AI systems in production reveals repeated patterns: guardrail failures, silent model drift, hidden vendor dependency, and a collection of near-misses worth distilling.

February 27, 2026 7 min 312 4.6

Methodologies

SRE with AI: dashboards that actually help

Los cuadros de mando con IA llevan un par de años prometiendo detección de anomalías mágica y causa raíz automática. La realidad es más modesta pero también más útil, si se sabe separar el ruido del valor real. Repaso honesto de qué funciona y qué no.

February 3, 2026 6 min 256 4.3

Technology

Observability tools I would recommend in 2026

After a decade of Prometheus, three years of consolidation around OpenTelemetry, and the open stack now mature with Grafana, Loki, and Tempo, concrete recommendations for teams starting or reviewing their observability layer: what fits, what is excess, and what to avoid.

January 13, 2026 6 min 298 4.0

Artificial Intelligence

AI agent observability: what to instrument first

Agents that chain calls to models, tools and memory are hard to debug without instrumentation designed for them. After a long year running agents in production, I cover what to measure first, which standards are consolidating, and which costly mistakes are avoided by getting the traces right from the start.

December 8, 2025 8 min 275

Technology

Parca, Beyla and Grafana: a sidecar-free observability stack

The combination of Parca for continuous profiling, Beyla for eBPF auto-instrumentation, and Grafana as the visualisation layer delivers deep observability without touching code. A look at how the three pieces fit together and where the limits still show.

August 13, 2025 8 min 234 4.5

Methodologies

Continuous profiling with eBPF in production

Continuous profiling with eBPF samples every process's execution stack every few milliseconds without touching the code, then stores the history so you can compare last week's performance with today's. The cost measured in production runs between 1% and 3% of CPU, and it pays off most in databases, API gateways and high-concurrency services.

June 8, 2025 6 min 242 4.5

Methodologies

The Site Reliability Workbook: patterns we still use

Han pasado siete años desde que Google publicó el Workbook, y buena parte del libro no ha envejecido. Repaso los patrones que de verdad aplicamos en equipos pequeños y los que resultaron ser cultura de campus.

June 5, 2025 6 min 219

Methodologies

Zero Trust integrated with SIEM: what actually works

Two years after Zero Trust stopped being a marketing word, it is worth looking at how it connects with the SIEM teams run day to day. A look at useful signals, avoidable noise, and the decisions that actually change security posture.

March 22, 2025 7 min 270 4.5

Technology

eBPF for Continuous Profiling: Parca and Beyla

eBPF-based continuous profiling captures CPU flame graphs for every process on a Linux node around the clock, without instrumenting code or restarting services, at under 1% overhead. Parca covers the whole cluster, Beyla adds automatic HTTP/gRPC metrics and traces, and Pyroscope brings native per-language detail to the most critical services.

November 19, 2024 5 min 317 4.3

Artificial Intelligence

LLM Observability: Traces, Costs, and Quality

LLM applications need three distinct observability planes: prompt and response traces for debugging hallucinations, per-token and per-feature cost tracking, and response quality evaluation. Mature tools like Langfuse, LangSmith, and Helicone cover all three planes with specific instrumentation.

November 10, 2024 6 min 246

Architecture

Container Monitoring: Beyond cAdvisor

cAdvisor is still embedded in kubelet and covers surface metrics, but falls short for production Kubernetes. The modern minimum stack pairs it with kube-state-metrics, node-exporter, Prometheus, and Grafana as a base, eBPF for deep network and syscall visibility, and OpenTelemetry for application context.

May 29, 2024 3 min 229 4.6

Tools

Fluent Bit: Lightweight Log Collection in Production

Fluent Bit is the CNCF's lightweight log collector: a ~1.5 MB C binary that rarely tops 30 MB of memory in production. It beats Promtail, Vector, and Filebeat when several destinations or resource-constrained nodes are in play, thanks to a pipeline of inputs, parsers, filters, and outputs that stays easy to reason about and debug.

May 8, 2024 4 min 231 4.3

Methodologies

Observability and SLOs: Error Budgets That Get Met

SLOs and error budgets only work when the budget drives real decisions. A feature freeze that triggers on exhaustion, deploy velocity that adjusts to consumption. With two or three well-chosen SLIs, a clear freeze policy, and simple tools like Prometheus with Sloth, a team can sustainably balance velocity and reliability in production.

February 29, 2024 5 min 222 4.6

Technology

Loki at Scale: Lessons from High-Volume Logs

Loki indexes only labels, not log content, which cuts storage costs dramatically compared to Elasticsearch. The main production risk is cardinality explosion each unique label-value combination generates a stream that inflates the index and slows queries. Separating read and write paths ensures a heavy query cannot saturate ingestion.

February 11, 2024 5 min 208

Technology

Falco: Runtime Threat Detection with eBPF

Falco is a graduated CNCF project that hooks the Linux kernel via eBPF and detects syscall anomalies in real time without instrumenting applications. Deployed as a DaemonSet on Kubernetes, it emits JSON events and requires a triage process to deliver value. In production, alert fatigue is the most common operational pitfall.

December 31, 2023 5 min 244 4.4

Architecture

eBPF: Kernel Observability Without Recompiling

eBPF is a Linux kernel technology that lets you load and run verified, high-performance programs without recompiling the kernel or rebooting the system. It runs safely inside a virtual machine in the kernel and underpins tools such as Cilium, Pixie, Falco, and Tetragon for real-time tracing, networking, and security.

November 19, 2023 5 min 205 4.4

Architecture

PostgreSQL 16: Changes That Affect Day-to-Day Work

PostgreSQL 16, released in September 2023, adds logical replication from a standby, the pg_stat_io view for breaking down I/O by operation type and context, and parallel FULL OUTER JOIN support. Upgrading from 15 is straightforward; 13 loses support in November 2025, so plan the update soon.

November 7, 2023 5 min 207 4.3

Technology

The Grafana Stack: Loki, Tempo, and Mimir for Open Observability

The Grafana stack combines three open source projects: Loki for logs, Tempo for traces, and Mimir for metrics. All three keep data in object storage (S3/GCS) with a minimal index instead of indexing everything like Elasticsearch, which cuts cost sharply at high volume and lets you correlate metric, log, and trace from a single Grafana panel.

August 27, 2023 4 min 399 4.3

Architecture

OpenTelemetry: Unifying Logs, Metrics, and Traces

OpenTelemetry is the CNCF project, graduated in May 2026, that unifies logs, metrics, and traces under one SDK and the OTLP protocol, without locking you into a single backend. Traces have been stable since 2021 and metrics since 2023; logs are still maturing, but already worth adopting on new projects.

August 24, 2023 4 min 235 4.6

Architecture

Kubernetes 1.28: Sidecar Containers as First-Class Citizens

Kubernetes 1.28 introduces native sidecar containers in alpha via KEP-753: adding restartPolicy Always to initContainers ensures correct startup and shutdown ordering. It fixes Jobs that never terminate. Istio, Linkerd, and observability agents like Fluent Bit are the primary beneficiaries.

July 19, 2023 4 min 250 4.7

Methodologies

Prometheus: Writing Alerts That Won’t Get Ignored

To write Prometheus alerts that won't get ignored, alert on customer-observable symptoms (latency, error rate, saturation) instead of internal causes like CPU or memory, define SLOs with multi-window burn rate to scale severity, add a watchdog alert that confirms the system is still alive, and review the signal-to-noise ratio every quarter.

July 1, 2023 5 min 237 3.9

Architecture

Pixie: Native Kubernetes Observability Powered by eBPF

Pixie uses eBPF to automatically instrument Kubernetes clusters without modifying application code. A per-node agent captures HTTP, gRPC, SQL, and Redis traffic at the kernel level, exposing service maps, CPU profiles, and SQL traces within minutes. It complements Prometheus for reactive diagnosis with no sidecars or redeploys.

June 19, 2023 4 min 270

Technology

eBPF: High-Performance Monitoring in Linux

eBPF (Extended Berkeley Packet Filter) is a Linux kernel technology that runs verified programs directly inside the kernel, with no modules and no source-code changes. The kernel verifier rejects any unsafe program before it runs, letting teams monitor system calls, network traffic, and I/O at a much lower CPU cost than traditional external probes.

June 10, 2023 5 min 241 4.3