
Real-World SRE
By :

I worked as an SRE at Google for four years, and that is where I started specializing, moving away from being a full stack engineer, and instead considering myself an SRE. Google had lots of internal education courses, and when I left, I found it difficult to continue my education. I also quickly discovered that SRE at Google is a very different beast than SRE at much smaller organizations. I decided to write this book for people interested in starting with SRE or applying it to organizations that are much smaller than Google.
To do this, the book is broken up into two parts. The first eight chapters walk through the hierarchy of reliability. This hierarchy was originally designed by Mikey Dickerson of the United States Digital Service (and– surprise, surprise –Google). The hierarchy says that as you are trying to add reliability to a system, you need to walk through each level before you get to the next one.
The following diagram shows a slightly modified version of Mikey's original pyramid. I have updated it to include the all-encompassing aspect of communication:
Figure 2: This seven-layer pyramid is encircled with communication. Each layer builds upon and needs the previous layer. It is surrounded by communication because each layer needs communication to succeed.
Let us walk through the layers as a preview of what you can expect in each chapter.
Some engineers will have had bad experiences and will not think monitoring is worth the investment, whereas others will have religious zealotry toward certain tools, and some will just ignore you. This chapter will help you to navigate all of these competing opinions and find and create the implementation that is best for your project and team.
Individuals should not be afraid of incidents, but rather feel confident that if an incident happens, the team will respond and improve the system for the future, instead of focusing on the shame and anger that can come with failure. Incidents are things to learn from, not things to be afraid and ashamed of!
Nori Heikkinen, an SRE at Google with many years of experience, adds that "the hierarchy does not include prevention, partly because 100% uptime is impossible, and partly because the bottom three needs in the hierarchy must be addressed within an organization before prevention can be examined." (https://www.infoq.com/news/2015/06/too-big-to-fail)
The last two chapters of this book are a cheat section and introduction to common useful topics.
Change the font size
Change margin width
Change background colour