Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Metodologías

Observability and SLOs: Error Budgets That Get Met

Observability and SLOs: Error Budgets That Get Met

Actualizado: 2026-05-03

SLOs (Service Level Objectives) and error budgets are classic SRE concepts popularised by Google. Most mid-size teams know them, many “have” them, few genuinely manage them. The difference is whether the error budget informs decisions — whether a feature freeze triggers when the budget is exhausted, whether deploy velocity adjusts when it is consumed fast. This article is about making it really work, not just documenting.

Key Takeaways

  • SLI measures something user-relevant; SLO defines the objective; the error budget is the gap between 100% and the SLO.
  • The policy applied when budget is exhausted is where real value plays — if there are no consequences, the SLO does not exist.
  • Choosing good SLIs is the hardest part: purely infra SLIs (CPU, memory) do not measure user experience.
  • The most common anti-patterns are aspirational SLOs, not respecting the freeze, and SLOs without owners.
  • Multi-window multi-burn-rate alerts are the standard to avoid alert fatigue without losing real signal.

The Basic Concept

Three definitions to keep clear:

  • SLI (Service Level Indicator): metric measuring something user-relevant. E.g. “requests completing in <500ms”, “availability” (uptime).
  • SLO (Service Level Objective): target for the SLI. E.g. “99.9% of requests in <500ms over 30 days”.
  • Error budget: the gap between 100% and the SLO. If SLO is 99.9%, budget is 0.1% = 43 minutes per month.

The powerful part: if the budget is exhausted, you stop deploying new features and focus on stability.

Starting Without Ceremony

You do not need an SLO committee. For any service: pick 2-3 relevant SLIs, define the SLO discussing it with product, set rolling 30 days as the period. That is enough to start.

Prometheus Implementation

yaml
# SLI: fraction of successful requests in <500ms
- record: service:sli:success_fast:ratio_rate5m
  expr: |
    sum(rate(http_requests_total{service="api", status!~"5..", duration_bucket="0.5"}[5m]))
    /
    sum(rate(http_requests_total{service="api"}[5m]))

# Error budget burn rate
- record: service:error_budget:burn_rate_1h
  expr: |
    (1 - service:sli:success_fast:ratio_rate5m) / 0.001

- alert: ErrorBudgetBurnFast
  expr: service:error_budget:burn_rate_1h > 14.4
  for: 2m
  annotations:
    summary: "Consuming 2% of monthly error budget per hour"

Multi-window multi-burn-rate alerts (Google SRE Workbook) are the standard: high short-window burn = urgent alert; sustained burn = moderate alert.

Error Budget Policy: The Political Part

Defining the SLO is easy. The policy applied when it is exhausted is where real value plays.

Typical tiered policy: >50% consumed — caution, more deploy review; >75% — reduce non-essential changes; >100% — feature freeze, only fixes; >150% — escalate to management, audit causes.

The key is that the policy is respected. If product overrides the freeze when triggered, the SLO does not exist in practice.

SLI Design: The Real Work

Choosing good SLIs is the hardest part. Red flags: pure infra SLIs (CPU, memory), SLIs not correlating with an annoyed user, service-level aggregates mixing critical and non-critical endpoints, SLIs without clear time window.

Good SLIs measure user-visible metrics: public endpoint latency, errors the user sees, correct data.

Error Budget as Conversation Tool

The biggest value of SLOs and error budgets is aligning conversations: product understands going faster has quantifiable cost; engineering has a clear threshold for requesting stabilisation time without having to “sell” the argument; management sees metrics correlating with customer satisfaction.

Tools

Typical stacks: Prometheus + Grafana + recording rules (DIY, flexible); Sloth[1] (sweet spot for small teams); Pyrra[2] (SLO as code + native UI); Datadog SLOs (integrated, easy, vendor lock-in).

Anti-Patterns

Things that break SLOs in practice:

  • Aspirational SLOs without realism: 99.99% when the service is really at 99%. Budget always consumed, policy always ignored.
  • Not respecting the freeze when exhausted: immediately destroys mechanism credibility.
  • SLOs without clear owner: nobody maintains them, they go stale.
  • Too many SLOs: 20 SLOs per team = none gets real attention.
  • Manipulable SLIs: gaming destroys the meaning of the system.

Conclusion

SLOs and error budgets work when applied rigorously, not as ornamental documentation. The test is simple: do decisions change based on the budget? If yes, the system works. If not, it is theatre. Start with 2-3 well-chosen SLIs, a clear freeze policy, and simple tools (Prometheus + Sloth). Sophistication comes later; first, respect the basics.

Was this useful?
[Total: 14 · Average: 4.6]
  1. Sloth
  2. Pyrra

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.