References and further reading

Introduction

Ford2008 - C. Ford, I. Gileadi, S. Purba, and M. Moerman, Patterns for Performance and Operability: Building and Testing Enterprise Software. Boca Raton: Auerbach Publications, 2008.

HumbleFarley2010 - J. Humble and D. Farley, Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation, 1 edition. Upper Saddle River, NJ: Addison Wesley, 2010.

Nygard2007 - M. T. Nygard, Release It!: Design and Deploy Production-Ready Software, 1 edition. Raleigh, N.C: Pragmatic Bookshelf, 2007.

Chapter 1 - What does good operability look like?

Beyer2016 - Beyer, Betsy, Jennifer Petoff, Chris Jones, and Niall Richard Murphy. Site Reliability Engineering. 1 edition. Beijing ; Boston: O’Reilly, 2016.

Black2012 - Black, Edwin. “IBM’s Role in the Holocaust – What the New Documents Reveal.” Huffington Post (blog), February 27, 2012. https://www.huffingtonpost.com/edwin-black/ibm-holocaust_b_1301691.html

Ford2008 - Ford, Chris, Ido Gileadi, Sanjiv Purba, and Mike Moerman. Patterns for Performance and Operability: Building and Testing Enterprise Software. 1 edition. Boca Raton: Auerbach Publications, 2008.

Gawande2011 - A. Gawande, The Checklist Manifesto: How to Get Things Right, Reprint edition. New York: Picador, 2011.

Leveson2017b - Leveson, Nancy G. 2017. Engineering a Safer World: Systems Thinking Applied to Safety. Reprint edition. Cambridge, Massachusetts London, Englang: MIT Press.

Nygard2018 - Nygard, Michael T. Release It! Design and Deploy Production-Ready Software. 2nd ed. edition. Raleigh, North Carolina: O′Reilly, 2018.

Chapter 2 - Core Operability Practices

Allspaw2010 - John Allspaw, and Jesse Robbins. 2010. Web Operations. O’Reilly Media. http://shop.oreilly.com/product/0636920000136.do

Binette2018 - Binette, Elisa. 2018. “Your Guide to Setting SLOs and SLIs.” New Relic Blog (blog). October 31, 2018. https://blog.newrelic.com/engineering/best-practices-for-setting-slos-and-slis-for-modern-complex-systems/

Cohn2008 - M. Cohn, “Non-functional Requirements as User Stories”, 21-Nov-2008. [Online]. Available: http://www.mountaingoatsoftware.com/blog/non-functional-requirements-as-user-stories [Accessed: 12-May-2014].

Cohn2016 - Cohn, Mike. “What Are Story Points?” Mountain Goat Software. August 23, 2016. https://www.mountaingoatsoftware.com/blog/what-are-story-points

Davies2010 - R. Davies, ‘Non-Functional Requirements: Do User Stories Really Help?’, 2010. [Online]. Available: http://www.methodsandtools.com/archive/archive.php?id=113 [Accessed: 12-May-2014]

Geckoboard2020 - Laura Kukula | 9 dashboard design principles: see them in action with real examples. 20 February 2020. https://www.geckoboard.com/blog/9-dashboard-design-principles-see-them-in-action-with-real-examples/

Gilb2009 - T. Gilb, “Tom Gilb & Kai Gilb - Helping you deliver Value to your Stakeholders | Are non-functional requirements functional? : Tom Gilb and Kai Gilb’s blog”, 18-Jan-2009. [Online]. Available: http://www.gilb.com/blogpost70-Are-non-functional-requirements-functional [Accessed: 12-May-2014].

Humble2010b - Humble, Jez, and David Farley. 2010. “Continuous Delivery: Anatomy of the Deployment Pipeline.” http://www.informit.com/articles/article.aspx?p=1621865

Kubernetes2018 - Kubernetes. 2018. “Configure Liveness and Readiness Probes.” July 2018. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

LonWorks2009 - Echelon Corporation. 2009. “Introduction to the LonWorks® Platform.” http://downloads.echelon.com/support/documentation/manuals/general/078-0183-01B_Intro_to_LonWorks_Rev_2.pdf

Mitchell2017 - Ian Mitchell, “Walking Through a Definition of Done.” Scrum.Org. May 17, 2017. http://www.scrum.org/resources/blog/walking-through-definition-done

Murphy2016 - Murphy, Niall, Betsy Beyer, Chris Jones, and Jennifer Petoff. 2016. Site Reliability Engineering. O’Reilly Media. http://shop.oreilly.com/product/0636920041528.do

SAFe2014 - ‘Scaled Agile Framework’. [Online]. Available: http://scaledagileframework.com/ [Accessed: 12-May-2014].

SkeltonPaisLoggin2016 - Matthew Skelton and Manuel Pais. 2016. ‘Why and How to Test Logging’. InfoQ. https://www.infoq.com/articles/why-test-logging/

Chapter 3 - Use Run Book collaboration to increase operability and prevent operational issues

Gawande2011b - A. Gawande, The Checklist Manifesto: How to Get Things Right, Reprint edition. New York: Picador, 2011.

Goldschrafe2011 - J. Goldschrafe, 2011 ‘Runbooks are stupid and you’re doing them wrong’ 19-08-2011. [Online] http://holyhandgrenade.org/blog/2011/08/runbooks-are-stupid-and-youre-doing-them-wrong/ [Accessed 27-Oct-2016]

Kelly2016 - A. Kelly, 2016 ‘Dialogue Sheets’. [Online] https://www.softwarestrategy.co.uk/dialogue-sheets/ [Accessed 24-Oct-2016]

Chapter 4 - Use modern log aggregation for deep operational and insights

Black2016 - David Black, 2016 - “A “Log Reflector” for AWS Lambda” http://blog.davidablack.net/2016/08/17/a-log-reflector-for-aws-lambda/

Boswell2017 - Drew Boswell, 2017 - ‘A Million Metrics per Second’ [Online] https://medium.com/swissquote-engineering/a-million-metrics-per-second-17a4c7274062 [Accessed 28-Jan-2018]

Bourgon2017 - Peter Bourgon, 2017 - ‘Metrics, tracing, and logging’ http://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

Cheney2015 - Dave Cheney, 2015 - “Let’s Talk About Logging” https://dave.cheney.net/2015/11/05/lets-talk-about-logging

Cui2017 - Yan Cui, 2017 - “Yubl’s Road to Serverless, Part 3 - Ops” https://hackernoon.com/yubls-road-to-serverless-part-3-ops-6c82139bb7ee

DeBortoli2017 - Alberto De Bortoli, 2017 - “A better local and remote logging on iOS with JustLog” https://tech.just-eat.com/2017/01/18/a-better-local-and-remote-logging-on-ios-with-justlog/

Degioanni2015 - Degioanni, Loris. 2015. “How to Collect StatsD Metrics in Containers.” Sysdig. June 3. https://sysdig.com/blog/how-to-collect-statsd-metrics-in-containers/

GrahamCumming2017 - John Graham-Cumming, 2017 ‘Incident report on memory leak caused by Cloudflare parser bug’ https://blog.cloudflare.com/incident-report-on-memory-leak-caused-by-cloudflare-parser-bug/

Golubenco2016 - Tudor Golubenco, 2016 “Structured logging with Filebeat” https://www.elastic.co/blog/structured-logging-filebeat

Kreps2013 - Jay Kreps, 2013 ‘The Log: What every software engineer should know about real-time data’s unifying abstraction’ https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Majors2017 - Charity Majors, 2017. “Lies My Parents Told Me (About Logs)” [Online] https://honeycomb.io/blog/2017/04/lies-my-parents-told-me-about-logs/ [Accessed 28-Jan-2018]

OpenTracing2017 - OpenTracing. 2017. “Introduction · Opentracing.” 2017. http://opentracing.io/documentation/#what-is-a-trace

Paraschiv2017 - Eugen Paraschiv, 2017 ‘9 Logging Sins in Your Java Applications’. [Online] https://dzone.com/articles/9-logging-sins-in-your-java-applications [Accessed 28-Jan-2018]

Pivotal2017 - Pivotal, 2017 “Monitoring and Troubleshooting Apps with PCF Metrics | Pivotal Web Services Docs.” Pivotal Web Services Documentation. http://docs.run.pivotal.io/metrics/using.html#trace.

Rapid72016 - Rapid7, 2016 “Logging Mosquitto Server logs (from Raspberry Pi) to Logentries” https://blog.rapid7.com/2016/10/07/logging-mosquitto-server-logs-from-raspberry-pi-to-logentries/

Reselman2016 - Bob Reselman, 2016 ‘The Value of Correlation IDs’ https://blog.logentries.com/2016/12/the-value-of-correlation-ids/

Skelton2012 - Matthew Skelton, 2012 “Tune Logging Levels In Production Without Recompiling Code” https://blog.matthewskelton.net/2012/12/05/tune-logging-levels-in-production-without-recompiling-code/

Sridharan2017 - Cindy Sridharan, 2017 - “Logs and Metrics” https://medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38

Swan2017 - Chris Swan, 2017 “Operational Considerations for Containers”. [Online] https://www.infoq.com/presentations/containers-operations [Accessed 28-Jan-2018]

Turnbull2015 - James Turnbull, 2015 ‘Structuted Logging’. [Online] https://kartar.net/2015/12/structured-logging/ [Accessed 28-Jan-2018]

Chapter 5 - Use well-defined readiness checks to increase operational confidence

TBC

Chapter 6 - Use operability as a differentiating aspect of your software

KeepItUsable - “Personas: Why is it important to understand your users?”. https://www.keepitusable.com/blog/personas-why-is-it-important-to-understand-your-users/

Chapter 7 - Use patterns and principles from Team Topologies to inform the approach to operability

TBC

TBC

Appendix

AgileManifesto2001 - Manifesto for Agile Software Development. 2001. http://agilemanifesto.org/

Leveson2017a - Leveson, Nancy G. 2017. Engineering a Safer World: Systems Thinking Applied to Safety. Reprint edition. Cambridge, Massachusetts London, Englang: MIT Press.

Mogul2006 - Mogul, Jeffrey C. 2006. “Emergent (Mis)Behavior vs. Complex Software Systems.” In Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006, 293–304. EuroSys ’06. New York, NY, USA: ACM. https://doi.org/10.1145/1217935.1217964