Observability and SLOs: Error Budgets That Get Met

Gráfico de líneas ascendentes representando métricas de servicio en tiempo real

SLOs (Service Level Objectives) and error budgets are classic SRE concepts popularised by Google. Most mid-size teams know them, many “have” them, few genuinely manage them. The difference is whether the error budget informs decisions — whether a feature freeze triggers when the budget is exhausted, whether deploy velocity adjusts when it’s consumed fast. This article is about making it really work, not just documenting.

The Basic Concept

  • SLI (Service Level Indicator): metric measuring something user-relevant. E.g. “requests completing in <500ms”, “availability” (uptime).
  • SLO (Service Level Objective): target for the SLI. E.g. “99.9% of requests in <500ms over 30 days”.
  • Error budget: the difference between 100% and the SLO. If SLO is 99.9%, budget is 0.1% = 43 minutes per month.

The powerful part: if the budget is exhausted, you stop deploying new features and focus on stability.

Starting Without Ceremony

You don’t need an SLO committee. For any service:

  • Pick 2-3 relevant SLIs:
    • p99 latency of critical endpoints.
    • Availability (successful / total requests).
    • Freshness (data updated within X time) if applicable.
  • Define the SLO discussing it with product: “what latency would degrade experience?”. Target: 99-99.9% for critical services; 95-99% for less critical.
  • Period: 30 days rolling as default.

No more is needed to start. Sophistication comes later if you need it.

Prometheus Implementation

Practical API example:

# SLI: fraction of successful requests in <500ms
- record: service:sli:success_fast:ratio_rate5m
  expr: |
    sum(rate(http_requests_total{service="api", status!~"5..", duration_bucket="0.5"}[5m]))
    /
    sum(rate(http_requests_total{service="api"}[5m]))

# Error budget burn rate
- record: service:error_budget:burn_rate_1h
  expr: |
    (1 - service:sli:success_fast:ratio_rate5m) / 0.001  # 0.001 = 99.9% SLO

# Alert if burn rate is high
- alert: ErrorBudgetBurnFast
  expr: service:error_budget:burn_rate_1h > 14.4  # 1h of 2% budget
  for: 2m
  annotations:
    summary: "Consuming 2% of monthly error budget per hour"

Multi-window multi-burn-rate alerts (Google SRE workbook) are the standard: high burn rate short = urgent alert, sustained burn = moderate alert.

Error Budget Policy: The Political Part

Defining the SLO is easy. The policy applied when it’s exhausted is where the real value plays.

Typical policy:

  • >50% budget consumed: caution, more deploy review.
  • >75% consumed: careful, reduce non-essential changes.
  • >100% consumed: feature freeze, only fixes. Invest in stability.
  • >150% consumed: escalate to management, audit causes.

Key: the policy is respected. If, when triggered, product overrides the decision, the SLO doesn’t exist.

SLI Design: The Real Work

Choosing good SLIs is the hardest part. Red flags:

  • Pure infra SLIs (CPU, memory): don’t measure user experience.
  • SLIs that don’t correlate with an annoyed user: “200 responses” may include empty responses.
  • Service-level aggregated SLIs for APIs with multiple critical and non-critical endpoints.
  • SLIs without clear time window.

Better:

  • User-visible metrics: public endpoint latency, user-seen errors, correct data.
  • SLIs by dimension: per endpoint, per tenant, per region — if they matter.
  • Consistent window: rolling 30 days is practical.

Multi-SLO: When There Are Multiple Services

Mid-size orgs have 20+ services. Multiplying SLOs per service is chaotic. Useful pattern:

  • SLOs per “user journey” instead of per service. E.g. “signup flow SLO” includes backend, frontend, email delivery.
  • SLOs aggregated per tenant — experience received by customers that matter most.
  • Tiered SLOs: “north star SLO” (customer-facing) + technical “component SLOs”.

Not every service needs formal SLO; only customer-facing or those with critical dependencies.

Error Budget as Conversation Tool

The biggest value of SLOs/budgets is aligning conversations:

  • Product understands “going faster” has quantifiable cost (more budget consumption).
  • Engineering has a clear threshold for requesting stabilisation time.
  • Management sees metrics correlating with customer satisfaction.

Without this, decisions on “features or stability?” are political. With budgets, they’re data-driven.

Alert Fatigue and SLOs

Common mistake: making the burn-rate alert SEV-1. Result: teams woken every 10 days by noise.

Better pattern:

  • Multi-window multi-burn-rate: alerts matter only if burn is persistent.
  • SEV levels: extreme burn rate = SEV-1, moderate burn = next-day ticket.
  • Ticketing instead of pager for non-critical cases.

Goal is alerting when real action is needed, not every spike.

Tools

Typical SLO stacks:

  • Prometheus + Grafana + recording rules: DIY, flexible, requires work.
  • Sloth: recording rule and alert generator from a simple YAML.
  • Pyrra: SLO as code + native UI.
  • Datadog SLOs: integrated, easy, but vendor lock-in.
  • Google Cloud Service Monitoring: for GCP-native.
  • OpenSLO: proposed standard, adopted.

For small teams, Sloth + Prometheus is the sweet spot.

Anti-Patterns

Things I’ve seen break SLOs:

  • Aspirational SLOs without realism: 99.99% when service is really at 99%. Budget always consumed, policy ignored.
  • Not respecting freeze when exhausted: destroys mechanism credibility.
  • SLOs without clear owner: nobody maintains, they go stale.
  • Too many SLOs: 20 SLOs per team = none gets attention.
  • Manipulable SLIs: “500s don’t count if from that endpoint” — gaming destroys meaning.

Quarterly Review

SLOs aren’t static:

  • Each quarter, review SLOs with quarter data.
  • SLO too lax? (budget always available): tighten.
  • SLO too strict? (always exhausted): relax or invest in architecture.
  • SLI still represents experience? If product changed, SLI may have stopped correlating.

Conclusion

SLOs and error budgets work when applied rigorously, not as ornamental documentation. The test is simple: do decisions change based on the budget? If yes, the system works. If not, it’s theatre. Start with 2-3 well-chosen SLIs, clear freeze policy, and simple tools (Prometheus + Sloth) for maximum productivity. Sophistication comes later; first, respect the basics.

Follow us on jacar.es for more on SRE, observability, and service reliability.

Entradas relacionadas