Observability and SLOs: Error Budgets That Get Met
Actualizado: 2026-05-03
SLOs (Service Level Objectives) and error budgets are classic SRE concepts popularised by Google. Most mid-size teams know them, many “have” them, few genuinely manage them. The difference is whether the error budget informs decisions — whether a feature freeze triggers when the budget is exhausted, whether deploy velocity adjusts when it is consumed fast. This article is about making it really work, not just documenting.
Key Takeaways
- SLI measures something user-relevant; SLO defines the objective; the error budget is the gap between 100% and the SLO.
- The policy applied when budget is exhausted is where real value plays — if there are no consequences, the SLO does not exist.
- Choosing good SLIs is the hardest part: purely infra SLIs (CPU, memory) do not measure user experience.
- The most common anti-patterns are aspirational SLOs, not respecting the freeze, and SLOs without owners.
- Multi-window multi-burn-rate alerts are the standard to avoid alert fatigue without losing real signal.
The Basic Concept
Three definitions to keep clear:
- SLI (Service Level Indicator): metric measuring something user-relevant. E.g. “requests completing in <500ms”, “availability” (uptime).
- SLO (Service Level Objective): target for the SLI. E.g. “99.9% of requests in <500ms over 30 days”.
- Error budget: the gap between 100% and the SLO. If SLO is 99.9%, budget is 0.1% = 43 minutes per month.
The powerful part: if the budget is exhausted, you stop deploying new features and focus on stability.
Starting Without Ceremony
You do not need an SLO committee. For any service: pick 2-3 relevant SLIs, define the SLO discussing it with product, set rolling 30 days as the period. That is enough to start.
Prometheus Implementation
# SLI: fraction of successful requests in <500ms
- record: service:sli:success_fast:ratio_rate5m
expr: |
sum(rate(http_requests_total{service="api", status!~"5..", duration_bucket="0.5"}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))
# Error budget burn rate
- record: service:error_budget:burn_rate_1h
expr: |
(1 - service:sli:success_fast:ratio_rate5m) / 0.001
- alert: ErrorBudgetBurnFast
expr: service:error_budget:burn_rate_1h > 14.4
for: 2m
annotations:
summary: "Consuming 2% of monthly error budget per hour"Multi-window multi-burn-rate alerts (Google SRE Workbook) are the standard: high short-window burn = urgent alert; sustained burn = moderate alert.
Error Budget Policy: The Political Part
Defining the SLO is easy. The policy applied when it is exhausted is where real value plays.
Typical tiered policy: >50% consumed — caution, more deploy review; >75% — reduce non-essential changes; >100% — feature freeze, only fixes; >150% — escalate to management, audit causes.
The key is that the policy is respected. If product overrides the freeze when triggered, the SLO does not exist in practice.
SLI Design: The Real Work
Choosing good SLIs is the hardest part. Red flags: pure infra SLIs (CPU, memory), SLIs not correlating with an annoyed user, service-level aggregates mixing critical and non-critical endpoints, SLIs without clear time window.
Good SLIs measure user-visible metrics: public endpoint latency, errors the user sees, correct data.
Error Budget as Conversation Tool
The biggest value of SLOs and error budgets is aligning conversations: product understands going faster has quantifiable cost; engineering has a clear threshold for requesting stabilisation time without having to “sell” the argument; management sees metrics correlating with customer satisfaction.
Tools
Typical stacks: Prometheus + Grafana + recording rules (DIY, flexible); Sloth[1] (sweet spot for small teams); Pyrra[2] (SLO as code + native UI); Datadog SLOs (integrated, easy, vendor lock-in).
Anti-Patterns
Things that break SLOs in practice:
- Aspirational SLOs without realism: 99.99% when the service is really at 99%. Budget always consumed, policy always ignored.
- Not respecting the freeze when exhausted: immediately destroys mechanism credibility.
- SLOs without clear owner: nobody maintains them, they go stale.
- Too many SLOs: 20 SLOs per team = none gets real attention.
- Manipulable SLIs: gaming destroys the meaning of the system.
Conclusion
SLOs and error budgets work when applied rigorously, not as ornamental documentation. The test is simple: do decisions change based on the budget? If yes, the system works. If not, it is theatre. Start with 2-3 well-chosen SLIs, a clear freeze policy, and simple tools (Prometheus + Sloth). Sophistication comes later; first, respect the basics.