Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Metodologías

error budget observabilidad prometheus sli slo sre

Observability and SLOs: Error Budgets That Get Met

February 29, 2024 10 min read 61 reads

Table of contents

Key Takeaways
The Basic Concept
Starting Without Ceremony
Prometheus Implementation
Error Budget Policy: The Political Part
SLI Design: The Real Work
Error Budget as Conversation Tool
Tools
Anti-Patterns
Conclusion

Actualizado: 2026-05-03

SLOs (Service Level Objectives) and error budgets are classic SRE concepts popularised by Google. Most mid-size teams know them, many “have” them, few genuinely manage them. The difference is whether the error budget informs decisions — whether a feature freeze triggers when the budget is exhausted, whether deploy velocity adjusts when it is consumed fast. This article is about making it really work, not just documenting.

Key Takeaways

SLI measures something user-relevant; SLO defines the objective; the error budget is the gap between 100% and the SLO.
The policy applied when budget is exhausted is where real value plays — if there are no consequences, the SLO does not exist.
Choosing good SLIs is the hardest part: purely infra SLIs (CPU, memory) do not measure user experience.
The most common anti-patterns are aspirational SLOs, not respecting the freeze, and SLOs without owners.
Multi-window multi-burn-rate alerts are the standard to avoid alert fatigue without losing real signal.

The Basic Concept

Three definitions to keep clear:

SLI (Service Level Indicator): metric measuring something user-relevant. E.g. “requests completing in <500ms”, “availability” (uptime).
SLO (Service Level Objective): target for the SLI. E.g. “99.9% of requests in <500ms over 30 days”.
Error budget: the gap between 100% and the SLO. If SLO is 99.9%, budget is 0.1% = 43 minutes per month.

The powerful part: if the budget is exhausted, you stop deploying new features and focus on stability.

Starting Without Ceremony

You do not need an SLO committee. For any service: pick 2-3 relevant SLIs, define the SLO discussing it with product, set rolling 30 days as the period. That is enough to start.

Prometheus Implementation

yaml

# SLI: fraction of successful requests in <500ms
- record: service:sli:success_fast:ratio_rate5m
  expr: |
    sum(rate(http_requests_total{service="api", status!~"5..", duration_bucket="0.5"}[5m]))
    /
    sum(rate(http_requests_total{service="api"}[5m]))

# Error budget burn rate
- record: service:error_budget:burn_rate_1h
  expr: |
    (1 - service:sli:success_fast:ratio_rate5m) / 0.001

- alert: ErrorBudgetBurnFast
  expr: service:error_budget:burn_rate_1h > 14.4
  for: 2m
  annotations:
    summary: "Consuming 2% of monthly error budget per hour"

Multi-window multi-burn-rate alerts (Google SRE Workbook) are the standard: high short-window burn = urgent alert; sustained burn = moderate alert.

Error Budget Policy: The Political Part

Defining the SLO is easy. The policy applied when it is exhausted is where real value plays.

Typical tiered policy: >50% consumed — caution, more deploy review; >75% — reduce non-essential changes; >100% — feature freeze, only fixes; >150% — escalate to management, audit causes.

The key is that the policy is respected. If product overrides the freeze when triggered, the SLO does not exist in practice.

SLI Design: The Real Work

Choosing good SLIs is the hardest part. Red flags: pure infra SLIs (CPU, memory), SLIs not correlating with an annoyed user, service-level aggregates mixing critical and non-critical endpoints, SLIs without clear time window.

Good SLIs measure user-visible metrics: public endpoint latency, errors the user sees, correct data.

Error Budget as Conversation Tool

The biggest value of SLOs and error budgets is aligning conversations: product understands going faster has quantifiable cost; engineering has a clear threshold for requesting stabilisation time without having to “sell” the argument; management sees metrics correlating with customer satisfaction.

Tools

Typical stacks: Prometheus + Grafana + recording rules (DIY, flexible); Sloth^[1] (sweet spot for small teams); Pyrra^[2] (SLO as code + native UI); Datadog SLOs (integrated, easy, vendor lock-in).

Anti-Patterns

Things that break SLOs in practice:

Aspirational SLOs without realism: 99.99% when the service is really at 99%. Budget always consumed, policy always ignored.
Not respecting the freeze when exhausted: immediately destroys mechanism credibility.
SLOs without clear owner: nobody maintains them, they go stale.
Too many SLOs: 20 SLOs per team = none gets real attention.
Manipulable SLIs: gaming destroys the meaning of the system.

Conclusion

SLOs and error budgets work when applied rigorously, not as ornamental documentation. The test is simple: do decisions change based on the budget? If yes, the system works. If not, it is theatre. Start with 2-3 well-chosen SLIs, a clear freeze policy, and simple tools (Prometheus + Sloth). Sophistication comes later; first, respect the basics.

Was this useful?

[Total: 14 · Average: 4.6]

Post Views: 61

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Observability and SLOs: Error Budgets That Get Met

Key Takeaways

The Basic Concept

Starting Without Ceremony

Prometheus Implementation

Error Budget Policy: The Political Part

SLI Design: The Real Work

Error Budget as Conversation Tool

Tools

Anti-Patterns

Conclusion

Related posts

How to build a production-ready agent with the Anthropic SDK, step by step

AI-integrated DevOps tools in my daily flow

FinOps on agent tokens: the invoice that surprises

Claude Opus 4.7 and long-horizon tasks: real changes