Email the Author
You can use this page to email Quan Huynh about On-Call In Action.
About the Book
In today's "always-on" world, downtime is not an option. Your users expect seamless service, 24/7. Your business depends on it. But how do you guarantee that reliability when complex systems inevitably encounter turbulence? The answer lies in a world-class on-call capability.
"On-Call In Action" is your practical playbook for building just that. This isn't just another theoretical tome; it's a hands-on guide to navigating the high-stakes reality of modern on-call. We'll equip you with the SRE principles, incident management lifecycles, and effective alerting strategies (leveraging the Versus Incident project as our real-world example) that form the backbone of resilient operations.
This book, "On-Call In Action," is your friendly guide to making on-call work better. We'll show you:
- Why being on-call is so important.
- What to do when a problem (we call it an "incident") happens.
- How do you set up good alerts so you only get called for big problems?
- How to check if your services are running well (using simple goals).
- How to learn from mistakes without blaming anyone, so things get better.
- How to make good on-call schedules so people don't get too tired.
- How to create a supportive team for on-call work.
Stop just reacting to problems and start engineering reliability. Whether you're a tech person who is on-call, a manager, or just curious, this book will give you clear advice and real examples. We want to help you build an on-call system that keeps your services running and your team feeling good.
This book contains 11 chapters:
- Chapter 1 Foundations: Why On-Call Matters & SRE Principles
- Chapter 2 Anatomy of an Incident: The Management Lifecycle
- Chapter 3 Effective Alerting: Strategy and Routing Use Versus Incident
- Chapter 4: Integrating Monitoring Sources and Escalation Policies: A Case Study
- Chapter 5: Measuring Reliability: SLIs, SLOs, and Error Budgets
- Chapter 6: Putting It All Together: Practical Examples of Unified Alerting & Templating
- Chapter 7: Learning from Failure: Blameless Postmortems
- Chapter 8: Sustainable On-Call: Scheduling and Managing Burnout
- Chapter 9: Effective Incident
- Chapter 10: The On-Call Ecosystem: Tooling and Future Trends
- Chapter 11: On-Call in Action: Digital Customer Onboarding in Banking
About the Editor
I am a DevOps Lead at Vikki Digital Bank with extensive experience in designing, building, and managing mission-critical infrastructure for digital banking products on Amazon Web Services (AWS). My professional journey has been deeply rooted in the financial sector, where security, resilience, and scalability are paramount.
As a founder of DevOpsVN, my goal is to empower the tech community by simplifying complex concepts and offering actionable, real-world solutions. Through my writing, I strive to bridge the gap between theory and practice, helping others navigate the ever-evolving landscape of modern technology.