Leanpub: Publish Early, Publish Often

Process automation

Computers excel at doing things fast, but there’s a general trend of trusting them too much to do their work well. Small errors can pass undetected for a long time, accumulate and build up a time-bomb. Perhaps even worse, pointing the computer in a wrong direction and letting it run off can cause small oversights to quickly escalate into a major blunder.

Lots of things can cause bad automation, even with the best intentions of people building the software. Third-party systems can send invalid, unexpected data. Migrating a legacy database may uncover lots of unforeseen edge cases. One part of the system can decide to go rogue and disrupt everything around it.

Apart from having a crystal ball that can see into the future, the best way to stop bad automation is to create an automated system of oversight. Build up monitoring and alerting mechanisms that can spot when something out of the ordinary is happening, and get people in to investigate before it’s too late.

Here are some ideas on keeping automation in check:

Testing with small samples often doesn’t uncover all the data-driven issues of large legacy databases. If you’re converting a legacy database, run some basic characterisation statistics on the converted data and check with the domain experts whether things look all right. Remember the Grand Rapids hospital update that declared 5% of the population dead overnight.

If your system is automatically processing financial transactions, put monitors in place to check for trends. Good candidates are the expected volume of fraud or number of purchases per hour. If things fall too far outside the expected range, alert a person – even if things look as if they’re in your favour. Remember the 610,000 Japanese yen fat-finger error and MiDAS fiasco.

If your system is automatically changing some data, such as prices, put monitors in place to check that automated changes are inside a valid range. For example, alert a person if the price goes too low or too high. This will help you avoid cases such as the 28,639.14 Uber ride, or Repricer Express selling everything at $0.01.

Put monitors in place to check whether one of your systems is behaving significantly differently from the rest. For example, if a single trading processor is running 90% of the volume, get someone to investigate why before it’s too late. Remember how one of eight Knight Capital SMARS systems ran a previous version of the software and it almost bankrupted the company.

Consider that speeding up a single part of a process might create problems downstream. For example, increasing the capacity to send out customer notifications can overload your call centre and create more problems than it solves, such as in the Centrelink robo-debt fiasco.

If you’re generating random outcomes and they need to fall within some expected business rules, make sure to check those rules before you publish the results. Random things are just that – random – and, in some cases, might be surprising. It’s potentially better to alert a person, or even to crash the system, than to directly use such unexpected values. Remember the Pepsi 349 lottery.

If you ever use sample data to validate or monitor your software, make sure that your tests are clearly identified, isolated and don’t end up matching any real-world cases. Remember Jeff Sample and the 50 police raids on Walter Martin’s house.

Biometric matching isn’t magic, and biometry isn’t necessarily unique. Unrelated people do look alike. Twins can trick smart photo algorithms or leave a similar voice signature; remember the Kennedy sisters.

Monitor whether third-party systems are sending you strange data. For example, check whether some values appear a lot more frequently than the others. This will help to identify special markers for missing or invalid records, in particular where blank values aren’t allowed (remember the NO PLATE parking tickets). Make sure to check third-party data for more than one entity where you expect only one (remember concurrent criminal sentences). Check whether data is out of the usual range (for example, a payment request for $23,148,855,308,184,500).

If you’re sending important messages through a third-party system, don’t just trust that the notifications are dispatched. Build a mechanism requiring recipients to confirm that the messages have actually been delivered. Not everyone will confirm, of course, but you will at least be able to monitor trends and see if something unexpected happens, such as 50,000 people mistakenly dropping off the system. Remember the Queensland OneSchool police e-mails.

If you’re working on a system that’s supposed to work unattended and autonomously, leave it running for a long period of time and check whether gremlins appear. And do consider shutting the whole system down if it is mission critical and loses the ability to control itself.

Whenever you’re using a slowly depleting but limited resource, make sure to build in monitoring, and send alerts when it starts getting dangerously low. For example, if you’re using a count-down timer, notify someone to restart it before it gets to zero. Don’t just rely on a published procedure for people to follow, because they might forget or have higher priorities at the time when things become critical.

If you’re using any kind of hard-coded accounts for development and testing, make sure they don’t somehow find their way into production software. Remember five blanks granting Xbox access.

As Porky Pig would say, ‘That’s all folks.’ I hope these examples tickled your imagination, and that they’ll inspire you to improve how you design, test and build software systems. If you’d like to dive into any of the stories mentioned in the book further, check out the articles and references on in following appendix.

Up next

Appendix: References and bibliography