Metodologías Tecnología

devops google sre metodologias operaciones slo sre

Applying Google’s SRE Book Without Being Google

July 28, 2023 9 min read 62 reads

Table of contents

Key Takeaways
Principles That Travel Well
Where Adaptation Is Needed
Strict error budget policy
Dedicated roles
Tool scale
Formal capacity planning
A Realistic Starting Roadmap
Phase 1 (1–2 months): Basic SLOs + postmortem
Phase 2 (3–6 months): Toil reduction + on-call structure
Phase 3 (6–12 months): Mature alerting + culture
The Book’s Role in Team Formation
Conclusion

Actualizado: 2026-05-03

Google’s SRE book^[1], published in 2016, has become the de facto reference for reliability engineering. But it’s written for Google: thousands of engineers, in-house datacenters, internal tools. Applying it literally in a 10-person team running everything on AWS produces friction, sometimes frustration. The good news is 80% of the value comes from portable principles, not Google-specific infrastructure.

Key Takeaways

Five principles from the book translate to almost any context: SLOs, error budgets, blameless postmortems, toil management, and humane on-call.
Areas that don’t scale: binary error budget policy, dedicated roles, internal tools, and formal capacity planning.
A practical three-phase roadmap covers 80% of the value in 6–12 months.
Most relevant chapters outside Google: 3, 4, 5, 11, 13, 15, 16, and 17.
The SRE Workbook complements the book with more accessible exercises for small teams.

Principles That Travel Well

Five ideas from the book that work in almost any context:

SLO as a user contract. Define measurable service-level objectives (99.9% successful requests, p95 latency < 500ms). It’s the difference between “the system is doing well” and “the system meets the promise made to the customer”.
Error budget as permission to take risks. If your SLO is 99.9% and the month is at 99.95%, you have budget for experiments. If you’re near 99.9%, stop deploying and stabilise. It’s a dialogue mechanism between product and operations.
Blameless postmortem. After an incident, analyse what failed in processes and systems, not in people. This culture is built with explicit practices — shared template, judgement-free meeting — not with good intentions alone.
Toil management. Toil = repetitive manual work, automatable, with no lasting value. The book proposes a ceiling: <50% of team time. Measuring and systematically attacking it is perhaps the most valuable cultural change.
Humane on-call. Reasonable rotations (no more than 1 week every 6), explicit compensation, post-incident calm-down. It’s a labour right, not a rite of passage.

Continuous delivery process diagram, a core process in SRE teams that practise frequent deployment and error budgets

Where Adaptation Is Needed

Areas where the book’s literality clashes with small teams:

Strict error budget policy

At Google, exhausting the budget stops all feature deployment and only reliability releases happen until recovery. In a 10-person team this can be too binary. A more practical policy:

At 50% budget spent: increase review rigor.
At 80%: block big features.
At 100%: only critical bugfixes.

Dedicated roles

Google has dedicated SRE teams separate from dev teams. In small teams, both are the same people. The principles work equally well with a mixed team with explicit roles and rotations, without needing to separate human groups.

Tool scale

Borgmon, Monarch, Dapper — Google’s internal tools have decent open-source equivalents (Prometheus^[2], OpenTelemetry^[3], Grafana^[4]), but maturity and ergonomics aren’t equivalent. Adapt ideas without trying to clone tools, and frustration drops.

Formal capacity planning

The book dedicates chapters to proactive capacity planning across dozens of datacenters. In AWS with autoscaling and serverless, the provider solves most of the problem. Reading those chapters helps understand the phenomenon; applying them verbatim isn’t necessary.

A Realistic Starting Roadmap

Three phases that work in 5–30-person teams:

Phase 1 (1–2 months): Basic SLOs + postmortem

Choose 2–3 critical services. Define a simple SLO per service (availability + latency). Grafana dashboards showing error-budget burn rate. Introduce a blameless postmortem template. After the first incident using that process, review how it went.

Phase 2 (3–6 months): Toil reduction + on-call structure

Measure toil (Slack questions, deployment manuals, copy-paste between panels). Attack the top three with automation — not all, just the ones that double the ROI. Formalise on-call rotation with compensation and rules.

Phase 3 (6–12 months): Mature alerting + culture

Review all alerts applying symptom-oriented alert principles. Integrate postmortems as living documents, not buried PDFs. Use on-call retros to improve documentation and runbooks. NIS2 compliance described in the NIS2 directive requires exactly this level of incident documentation — the SRE process covers it naturally.

The Book’s Role in Team Formation

A common practice: shared reading of the book in book clubs of 1–2 chapters per week. Highest-value chapters outside Google:

Chapters 3 (Embracing Risk), 4 (Service Level Objectives), 5 (Eliminating Toil).
Chapter 11 (Being On-Call), 13 (Emergency Response), 15 (Postmortem Culture).
Chapters 16 (Tracking Outages), 17 (Testing for Reliability).

Chapters 18–20 (Cluster Management, Storage, Network) are academically interesting but hard to apply without a dedicated infrastructure army.

Complementary to the book: the SRE Workbook^[5] has practical exercises — more accessible for small teams.

Conclusion

SRE as a discipline contributes solid principles: measure, define user contracts, analyse failures without blame, automate the repetitive, treat on-call with respect. Those principles work in any team. What doesn’t translate is Google’s infrastructure — and that’s fine, because it’s not needed. Adapt ideas, not implementations.

Was this useful?

[Total: 0 · Average: 0]

Post Views: 62

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.