Applying Google’s SRE Book Without Being Google
Table of contents
- Key Takeaways
- Principles That Travel Well
- Where Adaptation Is Needed
- Strict error budget policy
- Dedicated roles
- Tool scale
- Formal capacity planning
- A Realistic Starting Roadmap
- Phase 1 (1–2 months): Basic SLOs + postmortem
- Phase 2 (3–6 months): Toil reduction + on-call structure
- Phase 3 (6–12 months): Mature alerting + culture
- The Book’s Role in Team Formation
- Conclusion
Actualizado: 2026-05-03
Google’s SRE book[1], published in 2016, has become the de facto reference for reliability engineering. But it’s written for Google: thousands of engineers, in-house datacenters, internal tools. Applying it literally in a 10-person team running everything on AWS produces friction, sometimes frustration. The good news is 80% of the value comes from portable principles, not Google-specific infrastructure.
Key Takeaways
- Five principles from the book translate to almost any context: SLOs, error budgets, blameless postmortems, toil management, and humane on-call.
- Areas that don’t scale: binary error budget policy, dedicated roles, internal tools, and formal capacity planning.
- A practical three-phase roadmap covers 80% of the value in 6–12 months.
- Most relevant chapters outside Google: 3, 4, 5, 11, 13, 15, 16, and 17.
- The SRE Workbook complements the book with more accessible exercises for small teams.
Principles That Travel Well
Five ideas from the book that work in almost any context:
- SLO as a user contract. Define measurable service-level objectives (99.9% successful requests, p95 latency < 500ms). It’s the difference between “the system is doing well” and “the system meets the promise made to the customer”.
- Error budget as permission to take risks. If your SLO is 99.9% and the month is at 99.95%, you have budget for experiments. If you’re near 99.9%, stop deploying and stabilise. It’s a dialogue mechanism between product and operations.
- Blameless postmortem. After an incident, analyse what failed in processes and systems, not in people. This culture is built with explicit practices — shared template, judgement-free meeting — not with good intentions alone.
- Toil management. Toil = repetitive manual work, automatable, with no lasting value. The book proposes a ceiling: <50% of team time. Measuring and systematically attacking it is perhaps the most valuable cultural change.
- Humane on-call. Reasonable rotations (no more than 1 week every 6), explicit compensation, post-incident calm-down. It’s a labour right, not a rite of passage.
Where Adaptation Is Needed
Areas where the book’s literality clashes with small teams:
Strict error budget policy
At Google, exhausting the budget stops all feature deployment and only reliability releases happen until recovery. In a 10-person team this can be too binary. A more practical policy:
- At 50% budget spent: increase review rigor.
- At 80%: block big features.
- At 100%: only critical bugfixes.
Dedicated roles
Google has dedicated SRE teams separate from dev teams. In small teams, both are the same people. The principles work equally well with a mixed team with explicit roles and rotations, without needing to separate human groups.
Tool scale
Borgmon, Monarch, Dapper — Google’s internal tools have decent open-source equivalents (Prometheus[2], OpenTelemetry[3], Grafana[4]), but maturity and ergonomics aren’t equivalent. Adapt ideas without trying to clone tools, and frustration drops.
Formal capacity planning
The book dedicates chapters to proactive capacity planning across dozens of datacenters. In AWS with autoscaling and serverless, the provider solves most of the problem. Reading those chapters helps understand the phenomenon; applying them verbatim isn’t necessary.
A Realistic Starting Roadmap
Three phases that work in 5–30-person teams:
Phase 1 (1–2 months): Basic SLOs + postmortem
Choose 2–3 critical services. Define a simple SLO per service (availability + latency). Grafana dashboards showing error-budget burn rate. Introduce a blameless postmortem template. After the first incident using that process, review how it went.
Phase 2 (3–6 months): Toil reduction + on-call structure
Measure toil (Slack questions, deployment manuals, copy-paste between panels). Attack the top three with automation — not all, just the ones that double the ROI. Formalise on-call rotation with compensation and rules.
Phase 3 (6–12 months): Mature alerting + culture
Review all alerts applying symptom-oriented alert principles. Integrate postmortems as living documents, not buried PDFs. Use on-call retros to improve documentation and runbooks. NIS2 compliance described in the NIS2 directive requires exactly this level of incident documentation — the SRE process covers it naturally.
The Book’s Role in Team Formation
A common practice: shared reading of the book in book clubs of 1–2 chapters per week. Highest-value chapters outside Google:
- Chapters 3 (Embracing Risk), 4 (Service Level Objectives), 5 (Eliminating Toil).
- Chapter 11 (Being On-Call), 13 (Emergency Response), 15 (Postmortem Culture).
- Chapters 16 (Tracking Outages), 17 (Testing for Reliability).
Chapters 18–20 (Cluster Management, Storage, Network) are academically interesting but hard to apply without a dedicated infrastructure army.
Complementary to the book: the SRE Workbook[5] has practical exercises — more accessible for small teams.
Conclusion
SRE as a discipline contributes solid principles: measure, define user contracts, analyse failures without blame, automate the repetitive, treat on-call with respect. Those principles work in any team. What doesn’t translate is Google’s infrastructure — and that’s fine, because it’s not needed. Adapt ideas, not implementations.