Site Reliability Engineering Tidbits
Site Reliability Engineering Tidbits
Learn SRE Principles & Techniques for Observability, Monitoring, SLOs, Resilience and Debugging.
About the Book
Site Reliability Engineering is a relatively young discipline focused on treating operations as a software problem. Because it is so young, the SRE knowledge base is still growing. The goal is to make this book short, light and fun, but most importantly relevant.
Each chapter in this book describes a Site Reliability Engineering concept in a short and easily digestible way. The chapters in this book aim to provide every software engineer with information that can be used to increase the reliability of the systems they work on.
Topics include: observability, monitoring, Service Level Objectives (SLOs), alerting, resilience and debugging.
These concepts have been at the core of my personal SRE journey, and my hope is that you will find them valuable too!
Table of Contents
- Copyright
- About the Author
- Introduction
-
15 months of 24x7 Primary On-Call — Here’s How I Survived
- Background
- Surface Actionable Metrics
- Alert on Symptoms Not Causes
- Ratios Rule - But Be Careful
- Emulate The Customer Experience: Probes Probes Probes
- Give Yourself Room to Fail - SLO Based Alerts
- Conclusion
-
Debugging Memory Leaks Using Go
- What is a Memory Leak?
- Debug Process
- Identification
- Root Cause Analysis / Source Analysis
-
How Probes Partition the Debug Space
- Probes
- Output
- Debugging Using Probes
-
Observability Metric Namespaces and Structures
- Metric Spaces
- Metric Trees
- Defining a Metric in Terms of its Children
- Increasingly Specific — Subsets of Data
- Ratios Rule
- It’s All in the Questions
- Generic Metrics Enriched With Tags
- Conclusion
-
Debugging: Getting To Impact Through SLOs
- Phrasing the Impact in Terms of Client Impact
- Guiding With SLOs
-
Deploying SLOs Across An Organization
- What is an SLO?
- Principles
- Representative the Client Experience
- Actionable
- Minimal Investment / Low Technical Overhead
- Low Number of False Positives
- Rollout Strategy
-
No Friction Application Observability Using Envoy
- Problem
- Envoy
- Example
- Conclusion
-
Alerting on SLOs
- Terminology Refresher
- Client Experience
- Objective Quantities
- Call to Action
- Generic Tooling
- Conclusion
-
Debugging Fundamentals: Profiling
- What is Profiling?
- Why Profile? - Risks of Not Profiling
- How to Profile
- Profile Profiles “drilling-down”
- Conclusion
-
Performance Analysis: Tuning Methodology Using a Simple HTTP Webserver
- Strategy
- Simple HTTP Server Architecture
- Determine Goals (Dimensions)
- Setup the Test Harness
- Observe
- Execute/Observe/Analyze
- Profile
- Analysis - Hypothesis
- Tune the Application - Experiment
- Execute/Observe/Analyze
- 2000 Requests / Second
- 3000 Requests / second
- Analysis - Hypothesis
- Tune the application - Experiment
- Execute/Observe/Analyze
- Conclusion
-
Distributed Tracing: Impact on Engineering Organizations
- Onboarding
- Development
- Operations
- Conclusion
-
Dashboard Patterns: Aggregate View
- Why Views?
- So What’s an Aggregate View?
- Throughput
- Availability
- Latency
- Conclusion
-
Dashboard Patterns: Component Views
- Purpose
- Feedback Loops
- In Practice
- Approach
- Conclusion
-
Why Capacity Planning Needs Queueing Theory (Without the Hard Math)
- Problem
- Capacity Planning Organizational Systems
- Conclusion
-
Debugging Lambda File Descriptor Exhaustion
- Background
- A Strange Occurrence
- AWS Support
- Ensuring the Rollup Script Worked
- Moving Forward
- Starting to Debug
- Back to Basics
- Verifying Hypothesis
- Bounding Resource Usage
- Error Free!
-
Debugging Heuristics: Drivers of Increased Latency
- Increase in the Amount of Work Being Done
- Increased in the Type of Work Being Done
- Change in the Amount of Work Performed in Each Transaction
- Conclusion
-
Knowledge Graphs: Increased Context in Human Involved Incident Response
- An Example
- So What is an Incident Response (IR) Knowledge Graph?
- Components
- IR Knowledge Graphs In Practice
- The Incident
- Conclusion
-
Bolt on Rate Limiting
- Protecting Resources
- What is Envoy??
- Solving Rate Limiting Using Envoy
- Conclusion
-
Debugging Strategies: Triangulation
- What is Triangulation?
- Example Scenario
- Heuristics
- Conclusion
-
Debugging SQL Performance Using the “EXPLAIN” Statement
- Methodology
- Determine the Table Schema
- Determine the Table Index
- EXPLAIN the Query
- Leveraging the Index
- Predicate Query Missing Sortkey
- Results
- Stay on Top of Your ETL Pipelines With Table Freshness Checks
- Detecting Resource Leaks With Baseline Tests
-
Data Operational Maturity
- Maturity Model
- Level 1 - Mechanism
- Level 2 - Consistency
- Level 3 – Accuracy
- Conclusion
-
Bulkheads in Action — Partitioning to Minimize Failure Impact
- What are Bulkheads?
- Why Use Bulkheads
- How?
- When to Use?
-
Retries in Action: Availability in Exchange for Latency
- What are Retries?
- Why Use Retries?
- How?
- When to Use?
- Caveats
-
Probing 101
- Uptime Probes
- What Probes Don’t Do
- Purpose of Probes
- How to start probing
- Uses
- Conclusion
-
Using Views for Backwards Compatible Data Migrations
- Common Database Clients
- Leveraging Views
- Example of a View Based Migration
- Conclusion
The Leanpub 60 Day 100% Happiness Guarantee
Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.
You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!
So, there's no reason not to click the Add to Cart button, is there?
See full terms...
Earn $8 on a $10 Purchase, and $16 on a $20 Purchase
We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.
(Yes, some authors have already earned much more than that on Leanpub.)
In fact, authors have earnedover $14 millionwriting, publishing and selling on Leanpub.
Learn more about writing on Leanpub
Free Updates. DRM Free.
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.
Learn more about Leanpub's ebook formats and where to read them