On-Call In Action
On-Call In Action
Site Reliability Engineering Best Practices for Building Resilient Systems
About the Book
In today's "always-on" world, downtime is not an option. Your users expect seamless service, 24/7. Your business depends on it. But how do you guarantee that reliability when complex systems inevitably encounter turbulence? The answer lies in a world-class on-call capability.
"On-Call In Action" is your practical playbook for building just that. This isn't just another theoretical tome; it's a hands-on guide to navigating the high-stakes reality of modern on-call. We'll equip you with the SRE principles, incident management lifecycles, and effective alerting strategies (leveraging the Versus Incident project as our real-world example) that form the backbone of resilient operations.
This book, "On-Call In Action," is your friendly guide to making on-call work better. We'll show you:
- Why being on-call is so important.
- What to do when a problem (we call it an "incident") happens.
- How do you set up good alerts so you only get called for big problems?
- How to check if your services are running well (using simple goals).
- How to learn from mistakes without blaming anyone, so things get better.
- How to make good on-call schedules so people don't get too tired.
- How to create a supportive team for on-call work.
Stop just reacting to problems and start engineering reliability. Whether you're a tech person who is on-call, a manager, or just curious, this book will give you clear advice and real examples. We want to help you build an on-call system that keeps your services running and your team feeling good.
This book contains 11 chapters:
- Chapter 1 Foundations: Why On-Call Matters & SRE Principles
- Chapter 2 Anatomy of an Incident: The Management Lifecycle
- Chapter 3 Effective Alerting: Strategy and Routing Use Versus Incident
- Chapter 4: Integrating Monitoring Sources and Escalation Policies: A Case Study
- Chapter 5: Measuring Reliability: SLIs, SLOs, and Error Budgets
- Chapter 6: Putting It All Together: Practical Examples of Unified Alerting & Templating
- Chapter 7: Learning from Failure: Blameless Postmortems
- Chapter 8: Sustainable On-Call: Scheduling and Managing Burnout
- Chapter 9: Effective Incident
- Chapter 10: The On-Call Ecosystem: Tooling and Future Trends
- Chapter 11: On-Call in Action: Digital Customer Onboarding in Banking
Table of Contents
- Forewords
-
1. Foundations: Why On-Call Matters & SRE Principles
- 1.1 The "Always-On" Expectation and the Role of On-Call
- 1.2 Site Reliability Engineering (SRE) Philosophy: A Foundation for Sustainable On-Call
- 1.3 Book Scope and Structure
-
2. Anatomy of an Incident: The Management Lifecycle
- 2.1 Defining an Incident vs. a Service Request
- 2.2 The Incident Management Lifecycle Phases
- 2.3 The Dynamic Nature of Incident Management and the Critical Role of Alerting
-
3. Effective Alerting: Strategy and Routing
- 3.1 Introduction: Alerting as the SRE Frontline
- 3.2 Elaborate on Core Alerting Concepts
- 3.3 Explain Centralized Alert Routing
- 3.4 Introduce Versus Incident as an Example
- 3.5 Practical Example: Basic Alert Routing with Versus Incident
- 3.6 From Strategy to Implementation
-
4. Integrating Monitoring Sources and Escalation Policies
- 4.1 The Necessity of Unified Alerting in Complex Systems
- 4.2 Case Study Deep Dive: The Book Info System on AWS
- 4.3 Defining Service Level Objectives for EKS Microservices
- 4.4 Practical Alerting Strategies for EKS Microservices
- 4.5 Defining Service Level Objectives for the Data Pipeline
- 4.6 Practical Alerting Strategies for the Data Pipeline
- 4.7 Crafting Actionable Templates
- 4.8 Designing Effective Escalation Policies
- 4.9 Bridging Practice and Principle
-
5. Measuring Reliability: SLIs, SLOs, and Error Budgets
- 5.1 Service Level Indicators (SLIs): Quantifying User Happiness
- 5.2 Service Level Objectives (SLOs): Setting Reliability Targets
- 5.4 SLOs vs. SLAs: Internal Goals vs. External Promises
- 5.5 Error Budgets: The Currency of Reliability
- 5.6 Managing Error Budget Consumption: Budget Burn Rate
- 5.7 Practical Implementation: Choosing SLIs and Setting SLOs
- 5.7 SLOs and Error Budgets as Active Control Mechanisms
- 5.8 Bridging Reliability to Alerting
-
6. Practical Examples of Unified Alerting & Templating
- 6.1 Integrating Diverse Monitoring Sources
- 6.2 Crafting a Unified Notification with Go Templates
- 6.3 Bridging to Blameless Postmortems
-
7. Learning from Failure: Blameless Postmortem
- 7.1 The Philosophy of Blamelessness: Shifting Focus from People to Systems
- 7.2 The Value Proposition: Benefits of a Blameless Approach
- 7.3 Implementing Blameless Postmortems: A Practical Guide
- 7.4 Root Cause Analysis (RCA) Techniques
- 7.5 Sustaining the Blameless Culture: Overcoming Challenges
- 7.6 Example Postmortem Report: Orders Service Outage
- 7.7 Learning from the Field: Case Studies
- 7.8 Transitioning from Blamelessness to Sustainable Operations
-
8. Sustainable On-Call: Scheduling and Managing Burnout
- 8.1 The Toll of Unsustainable On-Call
- 8.2 Sustainable Schedules: Fairness, Flexibility, and Recovery
- 8.3 Beyond Rotations: Safety & Support
- 84 Leadership: Championing Well-being
- 8.5 Measuring On-Call Health
- 8.6 Continuous On-Call Improvement
- 8.7 The Virtuous Cycle: On-Call & Innovation
- 8.8 A Case Study
- 8.9 From Sustainable Practices to Effective Incident Response
-
9. Effective Incident Communication
- 9.1 Understanding the Basics: Key Parts of Communication
- 9.2 Strategy and Planning: Getting Ready Before Problems Hit
- 9.3 Communicating Well When Things Go Wrong
- 9.4 Real-World Example: Book Info System Database Down
- 9.5 Communication is Your On-Call Superpower
-
10. The On-Call Ecosystem
- 10.1 Understanding the On-Call Ecosystem
- 10.2 The Tooling Landscape
- 10.3 Future Trends Shaping the On-Call Ecosystem
- 10.4 Conclusion
-
11. On-Call in Action: Digital Customer Onboarding in Banking
- 11.1 The Digital Customer Onboarding Journey
- 11.2 Our Example: VKBank's System for DCO
- 11.3 Defining Meaningful Reliability: SLIs, SLOs, and Error Budgets
- 11.4 Alerting Strategy
- 11.5 Responding to Problems: Using AWS Incident Manager for On-Call
- 11.6 Integrating Versus Incident with AWS Incident Manager
- 11.6 Sustainable On-Call Practices for the DCO Team
- 11.7 Effective Incident Communication
- 11.8 The Journey to Operational Maturity
- Acknowledgments
The Leanpub 60 Day 100% Happiness Guarantee
Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.
You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!
So, there's no reason not to click the Add to Cart button, is there?
See full terms...
Earn $8 on a $10 Purchase, and $16 on a $20 Purchase
We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.
(Yes, some authors have already earned much more than that on Leanpub.)
In fact, authors have earnedover $14 millionwriting, publishing and selling on Leanpub.
Learn more about writing on Leanpub
Free Updates. DRM Free.
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.
Learn more about Leanpub's ebook formats and where to read them