SLOs (Service Level Objectives) and error budgets are classic SRE concepts popularised by Google. Most mid-size teams know them, many “have” them, few genuinely manage them. The difference is whether the error budget informs decisions — whether a feature freeze triggers when the budget is exhausted, whether deploy velocity adjusts when it’s consumed fast. This article is about making it really work, not just documenting.
The Basic Concept
- SLI (Service Level Indicator): metric measuring something user-relevant. E.g. “requests completing in <500ms”, “availability” (uptime).
- SLO (Service Level Objective): target for the SLI. E.g. “99.9% of requests in <500ms over 30 days”.
- Error budget: the difference between 100% and the SLO. If SLO is 99.9%, budget is 0.1% = 43 minutes per month.
The powerful part: if the budget is exhausted, you stop deploying new features and focus on stability.
Starting Without Ceremony
You don’t need an SLO committee. For any service:
- Pick 2-3 relevant SLIs:
- p99 latency of critical endpoints.
- Availability (successful / total requests).
- Freshness (data updated within X time) if applicable.
- Define the SLO discussing it with product: “what latency would degrade experience?”. Target: 99-99.9% for critical services; 95-99% for less critical.
- Period: 30 days rolling as default.
No more is needed to start. Sophistication comes later if you need it.
Prometheus Implementation
Practical API example:
# SLI: fraction of successful requests in <500ms
- record: service:sli:success_fast:ratio_rate5m
expr: |
sum(rate(http_requests_total{service="api", status!~"5..", duration_bucket="0.5"}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))
# Error budget burn rate
- record: service:error_budget:burn_rate_1h
expr: |
(1 - service:sli:success_fast:ratio_rate5m) / 0.001 # 0.001 = 99.9% SLO
# Alert if burn rate is high
- alert: ErrorBudgetBurnFast
expr: service:error_budget:burn_rate_1h > 14.4 # 1h of 2% budget
for: 2m
annotations:
summary: "Consuming 2% of monthly error budget per hour"
Multi-window multi-burn-rate alerts (Google SRE workbook) are the standard: high burn rate short = urgent alert, sustained burn = moderate alert.
Error Budget Policy: The Political Part
Defining the SLO is easy. The policy applied when it’s exhausted is where the real value plays.
Typical policy:
- >50% budget consumed: caution, more deploy review.
- >75% consumed: careful, reduce non-essential changes.
- >100% consumed: feature freeze, only fixes. Invest in stability.
- >150% consumed: escalate to management, audit causes.
Key: the policy is respected. If, when triggered, product overrides the decision, the SLO doesn’t exist.
SLI Design: The Real Work
Choosing good SLIs is the hardest part. Red flags:
- Pure infra SLIs (CPU, memory): don’t measure user experience.
- SLIs that don’t correlate with an annoyed user: “200 responses” may include empty responses.
- Service-level aggregated SLIs for APIs with multiple critical and non-critical endpoints.
- SLIs without clear time window.
Better:
- User-visible metrics: public endpoint latency, user-seen errors, correct data.
- SLIs by dimension: per endpoint, per tenant, per region — if they matter.
- Consistent window: rolling 30 days is practical.
Multi-SLO: When There Are Multiple Services
Mid-size orgs have 20+ services. Multiplying SLOs per service is chaotic. Useful pattern:
- SLOs per “user journey” instead of per service. E.g. “signup flow SLO” includes backend, frontend, email delivery.
- SLOs aggregated per tenant — experience received by customers that matter most.
- Tiered SLOs: “north star SLO” (customer-facing) + technical “component SLOs”.
Not every service needs formal SLO; only customer-facing or those with critical dependencies.
Error Budget as Conversation Tool
The biggest value of SLOs/budgets is aligning conversations:
- Product understands “going faster” has quantifiable cost (more budget consumption).
- Engineering has a clear threshold for requesting stabilisation time.
- Management sees metrics correlating with customer satisfaction.
Without this, decisions on “features or stability?” are political. With budgets, they’re data-driven.
Alert Fatigue and SLOs
Common mistake: making the burn-rate alert SEV-1. Result: teams woken every 10 days by noise.
Better pattern:
- Multi-window multi-burn-rate: alerts matter only if burn is persistent.
- SEV levels: extreme burn rate = SEV-1, moderate burn = next-day ticket.
- Ticketing instead of pager for non-critical cases.
Goal is alerting when real action is needed, not every spike.
Tools
Typical SLO stacks:
- Prometheus + Grafana + recording rules: DIY, flexible, requires work.
- Sloth: recording rule and alert generator from a simple YAML.
- Pyrra: SLO as code + native UI.
- Datadog SLOs: integrated, easy, but vendor lock-in.
- Google Cloud Service Monitoring: for GCP-native.
- OpenSLO: proposed standard, adopted.
For small teams, Sloth + Prometheus is the sweet spot.
Anti-Patterns
Things I’ve seen break SLOs:
- Aspirational SLOs without realism: 99.99% when service is really at 99%. Budget always consumed, policy ignored.
- Not respecting freeze when exhausted: destroys mechanism credibility.
- SLOs without clear owner: nobody maintains, they go stale.
- Too many SLOs: 20 SLOs per team = none gets attention.
- Manipulable SLIs: “500s don’t count if from that endpoint” — gaming destroys meaning.
Quarterly Review
SLOs aren’t static:
- Each quarter, review SLOs with quarter data.
- SLO too lax? (budget always available): tighten.
- SLO too strict? (always exhausted): relax or invest in architecture.
- SLI still represents experience? If product changed, SLI may have stopped correlating.
Conclusion
SLOs and error budgets work when applied rigorously, not as ornamental documentation. The test is simple: do decisions change based on the budget? If yes, the system works. If not, it’s theatre. Start with 2-3 well-chosen SLIs, clear freeze policy, and simple tools (Prometheus + Sloth) for maximum productivity. Sophistication comes later; first, respect the basics.
Follow us on jacar.es for more on SRE, observability, and service reliability.