October 27, 2013

Empathy

Tags: Memories, Technical, Finance

Last year Knight Capital released a bug into one of their automated equity trading systems resulting in the company losing $460 million in under an hour. As someone who has written automatic trading systems and had them used in anger, this definitely caught my attention at the time. Although without further information it is easy to dismiss its relevancy. Now, as the result of a SEC investigation, details of the incident have been published online
(with the the most salient details extracted here).

I was always incredibly nervous when any of my “dangerous” code went into production. This fear meant the code was reasonably well-tested (by me). I was always part of the team releasing it and supporting it the early next morning. Over 7 years, no one realised a loss due to a bug in my code. There may have been some lost opportunities, but no actual loss. Still, so many of the failings at Knight Capital sound familiar. It wouldn’t have been impossible for this situation to occur at one of the places I worked.

Below are my experiences related to the failings identified:

  • There is the impression of time pressure. The new system had to be released by a fixed date to match a change in an external market. No doubt management were pushing staff to just get it done. In front-office teams there was usually more tasks than people available to complete them. Thus most common project size was one person (so there was no one to check your work). Rarely was there time to go back and fix things. It was assumed that the work would be done fast and right the first time. Mistakes did occur.
  • A flag on the messaging system was repurposed. Why not just create a new one to ensure no interaction with systems using the old flag? I have seen this done many times and I once did it myself. No one wanted to do it, but it saved time. Messaging infrastructure was always handled by a separate back-office team and at one bank they only updated message fields every few months. If a coder needed a new field before this update they would have to reuse an old one. There was little other option.
  • An old system lay unused for many years. All the banks had numerous little programs listening to the message bus and working away. All the places I worked were very keen to shutdown any unused systems to free up computing resources. However, it is easy to imagine one (or a bit of unused functionality in an otherwise used system) slipping through the cull. With the average team member only hanging around for a few years, knowledge of these old systems is soon lost. Worse, sometimes other teams, without our knowledge, created systems relying on our messages or databases (we couldn’t restrict read-only access). We only discovered them when they phoned to complain if we changed something (and we told them where to go).
  • The new system was not deployed to one of the production servers. Clearly the release process should be run from a script that checks the result is correct and consistent. One bank had a manual release process. Developers would wait until all the traders had gone home and them manually copy and update files. At least a shared drive was used by all the servers so only one copy was required. Still it was error prone (releases regularly failed) and hopefully they have changed it since. Little improvements were made while I was there, but no one had the time to completely fix it.
  • When errors started occurring 97 emails were sent detailing the problem - all ignored. I’d be interested to know how many other emails were sent. There were always floods of spurious error messages caused by people logging problems at the wrong level or just covering their behind (“but it sent an error email - someone should have done something”). At two banks I got over a hundred “error” messages per day from various systems. At first it really caused concern, but others said that they could be safely ignored - so they were ignored. It sent support people crazy, so there were constant attempts to fix it with various degrees of success.
  • The issue was multiplied by removing the new system from production servers. This is so understandable in the banking IT culture. If there is a problem just after a release the assumption is always that the release caused it. Thus the release should be “rolled-back”, returning the environment to the pre-release state. Trading can then continue with the old version while IT can examine the cause of the problem with less immediate concern. The pressure to rollback can be intense, even if people are certain the problem is not in the release - you need to be very sure of yourself. It is like the old saying “no one ever lost their job buying IBM”, only not quite in this case.

I found front-office development strongly path dependent as we rarely had time to fix or improve old non-bug issues in code. We were always moving forward, relying on good staff (rather than process or automation) to avoid or ameliorate problems. Over time things improved - every bank had a better environment than the previous one. I never had a major issue. It would be nice to think this is the result of ability and competence. Although reports on incidents like Knight Capital’s show that when the environment and process is poor, luck also plays a strong role. I have some empathy for the developers at Knight Capital.