Mr Test
An ominous preview of Knight Capital’s meltdown happened a whole year earlier, illustrating another key risk for automated decision systems. One of Knight’s primary sources of revenue was market making for many smaller electronically traded funds. Market makers play a crucial role in financial exchanges by guaranteeing to buy or sell at a certain price, effectively ensuring that someone is always interested in matching an offer for less popular financial instruments. Knight Capital was a key player in one of the biggest stock exchanges in the world, so it had to ensure that its systems could continue working even in the event of data centre problems. In October 2011, Knight decided to test its disaster recovery plans. The test happened over a weekend, outside normal working hours, and was a success. What followed was everything but.
Market making is a high-volume business, mostly automated, so the engineers used a large set of test data to simulate a relevant flow of trading requests. Everything worked well, and people went home knowing that their recovery procedures could survive even a small disaster. Inadvertently, however, a disaster had been triggered for the next day. Someone had forgotten to remove the test data after the experiment. When trading resumed on Monday morning, Knight’s computers continued to use the test data to match offers from the exchange. As a result, Knight lost more than US$7 million before someone spotted what was going on.
Although leaving test data in the real system for Knight was a mistake, software is often built to support running tests alongside real work. The more complex a system, the more likely it is to break at the seams. Having some way to place a test order or book a test trade is a cheap and effective way to check that everything is working, effectively putting some much needed automated oversight around algorithmic decision making. To make that idea work, however, it’s critical to actually recognise the test cases.
Computers at Hartsfield–Jackson International Airport in Atlanta failed to spot the test on 19 April 2006, causing travel chaos around the world. In order to prove that the security systems and staff are not asleep, the luggage X-ray machine at the airport occasionally shows images of suspicious devices. Normally, the computer identifies the suspect device and, a few moments later, warns that the alarm is part of a test. However, that Wednesday, a computer failed to identify a test case. The Transportation Security Administration agent screening luggage noticed something that looked like a bomb, but couldn’t find a bag that matched the image. He alerted a supervisor, and the two of them went through all the luggage on the conveyor belt again. The test bag invented by a computer wasn’t there, of course. This was too strange to ignore, so the two of them escalated the problem to the security director, who decided to call the Atlanta police bomb squad. Passengers had to evacuate the terminal, and all flights were grounded for two hours. Hartsfield–Jackson International is the busiest airport in the world, so the delayed flights caused a knock-on effect and disrupted travel around the world.
Test data problems can stay under the radar for a long time. The US Securities and Exchange Commission fined Citigroup more than $7 million in 2016 because of a software glitch that caused the Global Markets division of the bank to incorrectly report regulatory data for 15 years. Citigroup Global Markets assigned test trades to special bank branch codes, ranging from 089 to 100. In 1999, the bank changed from purely numeric to alphanumeric branch codes. Some real branches had codes starting with the number 10 and followed by a letter, but the regulatory reporting software incorrectly assumed they were just tests and decided not to include any related trades in ‘blue-sheet’ reports.
People sometimes make up special cases for testing that couldn’t possibly happen in real life, but make wrong assumptions about the world. James Bach got a parking ticket from the city of Everett on 16 December 2010, although he’d never parked in Everett. A county clerk confirmed that the ticket was in the system, but was confused by the case number. All tickets in Everett start with the number 10, but this one was 111111111. It turned out that the city of Everett had started using a new automated ticketing system just a few days before the alleged violation. Someone had obviously tried it out by issuing a made-up ticket that was easy to type in. That’s why the case number was all 1s. To ensure the ticket was clearly flagged as a test case, the tester issued it for the licence plate TESTER. Bach, a well-known software testing consultant and author, actually has a custom licence plate matching exactly that name. Luckily the clerk quickly recognised the error, and a judge dismissed the case.
The way to avoid such tunnel vision caused by idealistic data is to test software upgrades using real-world examples. However, this can create huge problems if test cases are not clearly identified. On 16 March 2010, New York police raided a house in Marine Park in Brooklyn. The house had been raided more than 50 times in the previous eight years, so New York Police Department officers were prepared for heavy resistance. Instead, they found only Rose and Walter Martin, both over 80 years old.
The Martins had got used to the police banging on their doors, sometimes up to three times a week. On paper, the address looked like a hotbed of crime, but in fact this was all caused by a software test gone wild. In 2002, the police had used the Martins’ address as part of a random data sample to test a new software system, but forgot to remove the test records afterwards. As a result, officers from all over New York started showing up in Marine Park looking for suspects.
In 2007, Rose wrote about the harassment to the Police Commissioner, Ray Kelly, warning that her husband’s blood pressure problems could lead to a heart attack if the house was raided again. Commissioner Kelly ordered investigators to remove the Martins’ address from their systems, but this turned out to be more difficult than expected. By that time, records had already been exchanged with many other police systems and copied into lots of different places. After the raid in 2010, Commissioner Kelly visited the Martins personally to apologise. When NBC TV picked up the story, even the New York City Mayor, Michael Bloomberg, publicly acknowledged the problem. Instead of trying to clean up test data further, police officials flagged the address with an alert, so that officers have to double-check any future visit with their superiors. It turns out that it was easier to change the police process than to fix a software test data problem.
The problem with test cases co-existing with real data gets even weirder when several systems need to talk to each other, because tests in one system are not recognised in another. This was the case of James Test, whose flight booking with American Airlines kept disappearing into a void. ‘The booking would last only long enough to process my credit card, then fade to just a test’, complained Mr Test to The Wall Street Journal. Jeff Sample ran into a similar problem caused by disagreements between the computer systems of his travel agent, an airline in Argentina, and a bank. The airline processed his flight booking from Buenos Aires to Patagonia, and took the payment from his credit card, but another system then falsely flagged it as a test case and deleted the ticket. Even worse, the flight booking system no longer recognised the card charge, so Sample had problems getting a refund.
Sometimes, the only way to inspect a complex set of computer systems is to allow special test cases to exist alongside real data. But this approach can backfire badly if the tests end up matching any real-world usage. This problem is particularly problematic if test data can also be used for authentication, as the next story shows.