Site Reliability Engineering Tidbits
Minimum price
Suggested price

Site Reliability Engineering Tidbits

Learn SRE Principles & Techniques for Observability, Monitoring, SLOs, Resilience and Debugging.

About the Book

Site Reliability Engineering is a relatively young discipline focused on treating operations as a software problem. Because it is so young, the SRE knowledge base is still growing. The goal is to make this book short, light and fun, but most importantly relevant.

Each chapter in this book describes a Site Reliability Engineering concept in a short and easily digestible way. The chapters in this book aim to provide every software engineer with information that can be used to increase the reliability of the systems they work on.

Topics include: observability, monitoring, Service Level Objectives (SLOs), alerting, resilience and debugging.

These concepts have been at the core of my personal SRE journey, and my hope is that you will find them valuable too!

About the Author

Danny Mican

Hello! I’m Danny.

I have over 11 years experience working with software. I’m a top ranked contributor on StackOverflow in python, django, javascript, go and unit-testing. During this time I've developed and led the development of dozens of successful software projects.

A couple years ago I discovered technical writing. I regularly write tech blogs on medium and my personal blog. I’ve also ghost-written content for some of the largest vendors in tech.

Thank you.

Table of Contents

  • Copyright
  • About the Author
  • Introduction
  • 15 months of 24x7 Primary On-Call — Here’s How I Survived
    • Background
    • Surface Actionable Metrics
    • Alert on Symptoms Not Causes
    • Ratios Rule - But Be Careful
    • Emulate The Customer Experience: Probes Probes Probes
    • Give Yourself Room to Fail - SLO Based Alerts
    • Conclusion
  • Debugging Memory Leaks Using Go
    • What is a Memory Leak?
    • Debug Process
    • Identification
    • Root Cause Analysis / Source Analysis
  • How Probes Partition the Debug Space
    • Probes
    • Output
    • Debugging Using Probes
  • Observability Metric Namespaces and Structures
    • Metric Spaces
    • Metric Trees
    • Defining a Metric in Terms of its Children
    • Increasingly Specific — Subsets of Data
    • Ratios Rule
    • It’s All in the Questions
    • Generic Metrics Enriched With Tags
    • Conclusion
  • Debugging: Getting To Impact Through SLOs
    • Phrasing the Impact in Terms of Client Impact
    • Guiding With SLOs
  • Deploying SLOs Across An Organization
    • What is an SLO?
    • Principles
    • Representative the Client Experience
    • Actionable
    • Minimal Investment / Low Technical Overhead
    • Low Number of False Positives
    • Rollout Strategy
  • No Friction Application Observability Using Envoy
    • Problem
    • Envoy
    • Example
    • Conclusion
  • Alerting on SLOs
    • Terminology Refresher
    • Client Experience
    • Objective Quantities
    • Call to Action
    • Generic Tooling
    • Conclusion
  • Debugging Fundamentals: Profiling
    • What is Profiling?
    • Why Profile? - Risks of Not Profiling
    • How to Profile
    • Profile Profiles “drilling-down”
    • Conclusion
  • Performance Analysis: Tuning Methodology Using a Simple HTTP Webserver
    • Strategy
    • Simple HTTP Server Architecture
    • Determine Goals (Dimensions)
    • Setup the Test Harness
    • Observe
    • Execute/Observe/Analyze
    • Profile
    • Analysis - Hypothesis
    • Tune the Application - Experiment
    • Execute/Observe/Analyze
    • 2000 Requests / Second
    • 3000 Requests / second
    • Analysis - Hypothesis
    • Tune the application - Experiment
    • Execute/Observe/Analyze
    • Conclusion
  • Distributed Tracing: Impact on Engineering Organizations
    • Onboarding
    • Development
    • Operations
    • Conclusion
  • Dashboard Patterns: Aggregate View
    • Why Views?
    • So What’s an Aggregate View?
    • Throughput
    • Availability
    • Latency
    • Conclusion
  • Dashboard Patterns: Component Views
    • Purpose
    • Feedback Loops
    • In Practice
    • Approach
    • Conclusion
  • Why Capacity Planning Needs Queueing Theory (Without the Hard Math)
    • Problem
    • Capacity Planning Organizational Systems
    • Conclusion
  • Debugging Lambda File Descriptor Exhaustion
    • Background
    • A Strange Occurrence
    • AWS Support
    • Ensuring the Rollup Script Worked
    • Moving Forward
    • Starting to Debug
    • Back to Basics
    • Verifying Hypothesis
    • Bounding Resource Usage
    • Error Free!
  • Debugging Heuristics: Drivers of Increased Latency
    • Increase in the Amount of Work Being Done
    • Increased in the Type of Work Being Done
    • Change in the Amount of Work Performed in Each Transaction
    • Conclusion
  • Knowledge Graphs: Increased Context in Human Involved Incident Response
    • An Example
    • So What is an Incident Response (IR) Knowledge Graph?
    • Components
    • IR Knowledge Graphs In Practice
    • The Incident
    • Conclusion
  • Bolt on Rate Limiting
    • Protecting Resources
    • What is Envoy??
    • Solving Rate Limiting Using Envoy
    • Conclusion
  • Debugging Strategies: Triangulation
    • What is Triangulation?
    • Example Scenario
    • Heuristics
    • Conclusion
  • Debugging SQL Performance Using the “EXPLAIN” Statement
    • Methodology
    • Determine the Table Schema
    • Determine the Table Index
    • EXPLAIN the Query
    • Leveraging the Index
    • Predicate Query Missing Sortkey
    • Results
  • Stay on Top of Your ETL Pipelines With Table Freshness Checks
  • Detecting Resource Leaks With Baseline Tests
  • Data Operational Maturity
    • Maturity Model
    • Level 1 - Mechanism
    • Level 2 - Consistency
    • Level 3 – Accuracy
    • Conclusion
  • Bulkheads in Action — Partitioning to Minimize Failure Impact
    • What are Bulkheads?
    • Why Use Bulkheads
    • How?
    • When to Use?
  • Retries in Action: Availability in Exchange for Latency
    • What are Retries?
    • Why Use Retries?
    • How?
    • When to Use?
    • Caveats
  • Probing 101
    • Uptime Probes
    • What Probes Don’t Do
    • Purpose of Probes
    • How to start probing
    • Uses
    • Conclusion
  • Using Views for Backwards Compatible Data Migrations
    • Common Database Clients
    • Leveraging Views
    • Example of a View Based Migration
    • Conclusion

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.

You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!

So, there's no reason not to click the Add to Cart button, is there?

See full terms...

80% Royalties. Earn $16 on a $20 book.

We pay 80% royalties. That's not a typo: you earn $16 on a $20 sale. If we sell 5000 non-refunded copies of your book or course for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earnedover $13 millionwriting, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub