The Site Reliability Workbook: patterns we still use
Actualizado: 2026-05-03
The Site Reliability Workbook turns seven and remains the second book I recommend when someone joins an operations team. The first is the original SRE book — the blue one — but the Workbook is the one people open the next day when they have an incident and need a template. With the perspective of seven years in production I’ve watched which chapters hold up under real small-team use and which only make sense at Mountain-View-campus scale.
Key takeaways
- Honest SLOs (slightly below what you actually achieve) are more useful than aspirational ones: they give real error budget to spend and negotiate with.
- The 28–30 day rolling window prevents the last day of the month from becoming a weird day no one wants to deploy.
- Blameless postmortem doesn’t mean no accountability; it means not punishing the person who ran the command — without accountability there is no follow-up.
- Error budget works as negotiating currency with product only when it’s written as a three-level policy agreed in advance.
- In 2–3 person teams, complex on-call is over-engineering; the simple policy of “whoever deployed on Friday answers the weekend” works better.
Honest SLOs over aspirational SLOs
The SLO implementation chapter remains the best I know. The core idea of measuring what the user feels, not what the server reports, is still not widely enough internalised. An SLO computed from the backend response code before the CDN, the user’s network and the browser lies with striking confidence.
In practice I’ve ended up doing the opposite of what many tutorials recommend. Instead of defining a sky-high aspirational SLO and missing it every month, I prefer defining an honest SLO, slightly below what I actually achieve, and having real error budget to spend.
- A 99.99 % SLO you miss 40 % of the time gives you no room for anything.
- A 99.5 % one you almost always meet gives you a measurable 3.6 hours of budget per month, and that becomes the currency you negotiate with product.
The book’s concrete technique I follow to the letter is the 28 or 30-day rolling window instead of the calendar month. That prevents the last day of the month from becoming a weird day no one wants to deploy.
Blameless postmortems, but with action
The postmortem chapter has aged well. The idea of blameless postmortem isn’t about not naming those responsible, it’s about not punishing the person who ran the command. It’s a nuance that matters because without accountability there is no follow-up.
What we’ve changed in practice is the format. The canonical template has timeline, summary, root cause, actions and learnings. In small teams it ends up being too much process for minor incidents. We’ve shortened it to four fields:
- What happened in plain language.
- Which metric detected it or why we didn’t detect it sooner.
- What we’ll do differently.
- By when.
Three paragraphs and two actions with dates are more useful than six pages no one rereads.
The blameless golden rule: if one person could break it, we’ll break it again. The action isn’t firing that person, it’s turning the operation into something a tired human cannot accidentally break. Pre-validation, confirmation for destructive operations, dry-run by default.
Error budget as real currency
The error budget is the pattern I use every week. If I have a 99.5 % monthly SLO, I have 3.6 hours of permitted unavailability per month. If I’ve already consumed 3 hours from an incident this month, I have little margin to deploy a risky version.
The part the book describes well and is hard to apply is using the budget as a signal for product, not only operations. When the budget runs out, there must be a policy known to everyone that stops functional deploys. If that isn’t agreed with product beforehand, the conversation becomes impossible in the middle of an incident.
In my experience this agreement works better written as three levels:
- Green: free deploy.
- Amber: deploy only with review and small changes.
- Red: functional freeze until the next period.
Two thresholds, three states, zero ambiguity.
Load management and graceful degradation
The load management chapter is where I deviate most from the book. The patterns described — priority queues, global rate limiters, shedding techniques — are correct but designed for high-traffic massively distributed systems.
What does work for small teams is the idea of graceful degradation: when a subsystem starts failing, you need to have decided ahead of time what the acceptable degraded state is.
- If the secondary database stops responding, the application continues in read-only mode.
- If the recommendation service goes down, the page shows popular content instead of breaking.
These decisions are cheap to implement if taken beforehand, and very expensive in the middle of the incident. The same principle applies to FinOps for AI: having thresholds and responses defined before excessive spend happens.
What we haven’t adopted
Not everything in the Workbook has served us:
- The on-call management chapter with complex rotations, compensation and strict primary-secondary assumes a team of at least six with a budget for shift compensation. In two or three-person teams, the simple policy that whoever deployed on Friday afternoon is the first to be paged works better. No formal compensation, but also no 18:00 Friday deploys.
- The quarterly full disaster recovery exercise is too ambitious to sustain. What we do, less ambitious but more sustainable, is a half-hour tabletop monthly: one team member raises a scenario and the others say what they’d do. Nothing actually falls over, but it forces us to review runbooks and find documentation gaps.
When it pays to return to the book
The Workbook stays on the shelf because there are three moments when returning to the index is worth it:
- When someone new joins the team and needs to understand why we measure what we measure.
- When an incident goes badly and the retrospective needs a shared frame.
- When someone from product asks why we can’t deploy faster, and a common language is needed to discuss reliability and velocity.
What the book does better than any other reference is provide vocabulary. Error budget, SLI, SLO, interactive versus asynchronous workload, partial failure. Without that shared vocabulary, reliability meetings collapse into personal impressions.
My recommendation: don’t read it cover to cover. Start with the SLO, postmortem and error budget chapters, apply them for three months, and come back to the book afterwards for the more advanced patterns. Apply the index with respect but not devotion.