Introduction

What is software operability and why should we care?

Software operability is the measure of how well a software system works when operating ‘live’ in production, whether that is the public cloud, a co-located datacentre, an embedded system, or a remote sensor forming part of an Internet of Things (IoT) network. We say that a software system with good operability works well and is operable. A highly operable software system is one that minimizes the time and effort needed for unplanned interventions (whether manual or automated) in order to keep the system running.

All too often teams and individuals ignore or downplay aspects of the live/production environment and operating procedures when building software, only for those aspects to cause problems when the software goes live. A focus on operability helps address these operational concerns and both prevent problems happening in the first place and also make our software systems more resilient to unexpected operating conditions.

“The most amazing abstractions, cleanest code, or beautiful algorithms are meaningless if your code doesn’t run well on production.” David Copeland (@davetron5000) [Copeland2013]

Software with a high level of operability is easy to deploy, test, and interrogate in Production. It provides us with the right amount of good-quality information about the state of the service being provided and exhibits predictable and non-catastrophic failure modes when under high load or abnormal conditions, even if those conditions were never foreseen. Systems with good software operability also lend themselves to rapid diagnosis and transparent recovery following a problem, because they have been built with operational criteria as first-class concerns.

Electric trains in Mallorca, Spain, built in 1929 by Siemens and still running in 2014: an example of a physical system with good operability - Photo Copyright (c) Matthew Skelton 2012
Electric trains in Mallorca, Spain, built in 1929 by Siemens and still running in 2014: an example of a physical system with good operability - Photo Copyright (c) Matthew Skelton 2012

Software systems which follow operability good practice will tend to be simpler to operate and maintain and will make anticipation and diagnosis of errors straightforward [Crowley2012]. Good operability leads to a reduced lifetime cost of ownership and fewer operational problems compared to those software systems whose owners have prioritised functionality heavily over operational criteria.

Where can operability techniques be used?

The techniques and approaches we explore in this book work for many kinds of software systems: cloud-native, traditional on-premises enterprise IT, high-frequency/low-latency, mobile apps, desktop client-server, embedded systems, IoT devices, wearables, and industrial/medical applications. We’re fairly certain that the techniques also apply to nuclear power stations and space rockets, but these sectors are out of our experience! The important thing here is that these techniques are not specific to any particular set of technologies but work in many different contexts (sometimes requiring some adjustments, sometimes not).

How to use this book

Chapter 1 provides an overview of software operability: what it is, why we need to focus on operability, and what we can expect to gain from it.

Chapters 2-7 provide specific tried-and-tested techniques and practices for enhancing operability for teams building and running software systems incuding:

  1. Add ‘hooks’ into software components for operational checks
  2. Have software development teams write a draft runbook
  3. Avoid expensive Production-only tooling
  4. Ensure that failure responses are gradual, graceful, and graph-able
  5. Treat logging as a first-class concern and a means of communication
  6. Have the product owner and developers on call for Production incidents
  7. Make operations activities more visible, for example using Kanban boards, ChatOps and graphing/alerting on operations (such as server restarts, load-balancer workarounds, etc.)

Each chapter is readable independently, containing the necessary level of detail to be understood and actionable on its own, without requiring any of the other chapters in the book to be read first (although certainly reading the full book will provide a more comprehensive understanding of the concepts and practices and their inter-relations).

We suggest that Product Managers read the chapters in this order, starting with the ‘why’ of operability:

  1. Chapter 6 on operability for product management
  2. Chapter 1 on good operability
  3. Chapter 2 on core practices
  4. then other chapters as needed

Chapter 7 on Team Topologies could be useful in the context of a move towards a more flow-based organisational operating model.

Software Archtiects and System Architects can nudge teams in healthy directions by demonstrating and advocating for the use of good operability techniques. We recommend that archtiects read the chapters in this order:

  1. Chapter 1 on good operability
  2. Chapter 2 on core practices
  3. Chapter 4 on logging techniques
  4. Chapter 3 on run book collaboration
  5. then other chapters as needed

Note: it is becoming good practice for architects to act as an Enabling team rather than a separate ‘ivory tower’ silo. See Chapter 7 on Team Topologies for details.

Team Leads and engineers are likely aware of many of the practices in this book already but the way in which the tools and techniques are used may be new. We recommend that Team Leads read the chapters in this order to internalise the nuance of the operability techniques:

  1. Chapter 4 on logging techniques
  2. Chapter 3 on run book collaboration
  3. Chapter 5 on operational checks
  4. Chapter 1 on good operability
  5. Chapter 2 on core practices
  6. then other chapters as needed

Remember that amost all the operability techniques in this book are focused on building shared awaressness within and across teams (a social practice), not just the use of a tool (a technical practice).

For people within Enabling teams, we recommend reading the chapters in this order:

  1. Chapter 7 on Team Topologies
  2. Chapter 1 on good operability
  3. Chapter 2 on core practices
  4. then other chapters as needed

The focus of an Enabling team is to help other teams adopt and understand better appraoches, so it’s vital that people in Enabling teams internalize the intent of the practices in this book: reduce operational overhead, improve outcomes for users, reduce rework, increase the viability of each separate service or application.

Terminology:

Terminology may seem a lesser concern but in fact extended use of incorrect terms can actually lead to the wrong values and assumptions setting in.

We recommend:

  • Avoid the term ‘non-functional requirements’; use ‘operational features’ instead
  • Avoid a ‘production-ization’ or ‘hardening’ development phase; build in operational excellence from the beginning instead

‘Slow-burner’ organisational changes:

While we’ve tried to provide very practical approaches to improve operability in this book, we also acknowledge the need to change organizational culture, in particular:

  • Recognise that today’s distributed, multi-component, multi-dependencies software systems demand team members with a deep skillset which takes time and dedication to either find or develop in-house
  • Treat software operations as a high-skill, value-add activity, not support tasks

What is covered in this book

We have not tried to cover every possible detail related to software operability; for a comprehensive guide to software operability, we recommend reading the books Patterns for Performance and Operability: Building and Testing Enterprise Software by Ford et al (Ford2008), Release It by M. Nygard (Nygard2007), and Continuous Delivery by Jez Humble and Dave Farley (HumbleFarley2010).

What this guide does provide is a set of hands-on practices based on real-world, tried-and-tested experience across multiple organizations for teams to adopt (and adapt) in order to promote and enhance software operability.

In short, if you build or run software systems and care about how well they work, then this book is for you!

Why we wrote this book

Over the course of several years, working in both permanent roles and across numerous engagements with clients as consultants, it became increasingly obvious to us that many organisations were missing some almost intangible but fundamental understanding of what it takes to enhance the performance, deployability, reliability and all-round “working well”-ability of their software systems.

Many discussions and debates about the nuances of these ideas and concepts ensued, and shaped the content of this book over the course of several years. Some of these themes were reflected in the emerging ‘DevOps’ movement, but others are more related to critical thinking, engineering practice, and understanding human factors.

Approaching the question from our separate perspectives of Software Development (Matthew), DevOps (Alex), and IT Operations (Rob) and enabled us to consider a wide range of team drivers, concerns and needs, leading us (we hope) to greater understanding of the symbiosis of development and operations teams, and towards developing methods and techniques for enhancing this relationship.

In the end, we turned a lot of that discussion and thinking into this book and other articles in an effort to share some of the insights gained from our years of experience and to offer some help and hopefully practical advice to developers, operations staff, product managers, owners and others who were interested in getting the most usable software platforms they could.

Feedback and suggestions

We’d welcome feedback and suggestions for changes.

Please contact us at info@operabilitybook.com or on the Leanpub discussion at https://leanpub.com/SoftwareOperability/feedback

Matthew Skelton, Alex Moore, and Rob Thatcher - December 2019