Applying Google’s SRE Book Without Being Google

Centro de datos con racks iluminados

Google’s SRE book, published in 2016, has become the de facto reference for reliability engineering. But it’s written for Google: thousands of engineers, in-house datacenters, internal tools. Applying it literally in a 10-person team running everything on AWS produces friction, sometimes frustration. The good news is 80% of the value comes from portable principles, not Google-specific infrastructure.

Principles That Travel Well

Five ideas from the book that work in almost any context:

  • SLO as a user contract. Define measurable service-level objectives (99.9% successful requests, p95 latency < 500ms). It’s the difference between “the system is doing well” and “the system meets the promise made to the customer”.
  • Error budget as permission to take risks. If your SLO is 99.9% and the month is at 99.95%, you have budget for experiments. If you’re near 99.9%, stop deploying and stabilise. It’s a dialogue mechanism between product and operations.
  • Blameless postmortem. After an incident, analyse what failed in processes and systems, not in people. This culture is built with explicit practices — shared template, judgement-free meeting — not with good intentions alone.
  • Toil management. Toil = repetitive manual work, automatable, with no lasting value. The book proposes a ceiling: <50% of team time. Measuring and systematically attacking it is perhaps the most valuable cultural change.
  • Humane on-call. Reasonable rotations (no more than 1 week every 6), explicit compensation, post-incident calm-down. It’s a labour right, not a rite of passage.

Where Adaptation Is Needed

And the areas where the book’s literality clashes with small teams:

Strict error budget policy

At Google, exhausting the budget stops all feature deployment and only reliability releases happen until recovery. In a 10-person team this can be too binary. A more practical policy: at 50% budget spent, increase review rigor; at 80%, block big features; at 100%, only critical bugfixes.

Dedicated roles

Google has dedicated SRE teams separate from dev teams. In small teams, both are the same people. The book assumes a separation that rarely exists outside big companies. The principles work equally well with a mixed team with explicit roles and rotations, without needing to separate human groups.

Tool scale

Borgmon, Monarch, Dapper — Google’s internal tools have decent open-source equivalents (Prometheus, OpenTelemetry, Grafana), but maturity and ergonomics aren’t equivalent. Adapt ideas without trying to clone tools, and frustration drops.

Formal capacity planning

The book dedicates chapters to proactive capacity planning across dozens of datacenters. In AWS with autoscaling and serverless, the provider solves most of the problem. Reading those chapters helps understand the phenomenon, not necessarily apply it verbatim.

A Realistic Starting Roadmap

Three phases that work in 5-30-person teams:

Phase 1 (1-2 months): Basic SLOs + postmortem

Choose 2-3 critical services. Define a simple SLO per service (availability + latency). Grafana dashboards showing error-budget burn rate. Introduce a blameless postmortem template. After the first incident using that process, review how it went.

Phase 2 (3-6 months): Toil reduction + on-call structure

Measure toil (Slack questions, deployment manuals, copy-paste between panels). Attack the top three with automation — not all, just the ones that double the ROI. Formalise on-call rotation with compensation and rules.

Phase 3 (6-12 months): Mature alerting + culture

Review all alerts with the principles covered in writing alerts that won’t be ignored. Integrate postmortems as living documents, not buried PDFs. Use on-call retros to improve documentation and runbooks.

The Book’s Role in Team Formation

A common practice: shared reading of the book in book clubs of 1-2 chapters per week. Not every chapter is equally relevant; highest-value outside Google:

  • Chapters 3 (Embracing Risk), 4 (Service Level Objectives), 5 (Eliminating Toil).
  • Chapter 11 (Being On-Call), 13 (Emergency Response), 15 (Postmortem Culture).
  • Chapters 16 (Tracking Outages), 17 (Testing for Reliability).

Chapters 18-20 (Cluster Management, Storage, Network) are academically interesting but hard to apply without a dedicated infrastructure army.

Complementary to the book: the SRE Workbook has practical exercises — more accessible for small teams.

Also see agile methodologies and their evolution to understand how SRE fits with continuous-delivery practices.

Conclusion

SRE as a discipline contributes solid principles: measure, define user contracts, analyse failures without blame, automate the repetitive, treat on-call with respect. Those principles work in any team. What doesn’t translate is Google’s infrastructure — and that’s fine, because it’s not needed. Adapt ideas, not implementations.

Follow us on jacar.es for more on SRE, observability, and production operations.

Entradas relacionadas