October 27, 2013
Empathy
Tags:
Memories,
Technical,
Finance
Last year Knight
Capital released a
bug into one of their automated equity trading systems resulting in the
company losing $460 million in under an hour. As someone who has written
automatic trading systems and had them used in anger, this definitely
caught my attention at the time. Although without further information it
is easy to dismiss its relevancy. Now, as the result of a
SEC
investigation, details of the incident have been published
online
(with the the most salient details extracted
here).
I was always incredibly nervous when any of my “dangerous” code went
into production. This fear meant the code was reasonably well-tested (by
me). I was always part of the team releasing it and supporting it the
early next morning. Over 7 years, no one realised a loss due to a bug in
my code. There may have been some lost opportunities, but no actual
loss. Still, so many of the failings at Knight Capital sound familiar.
It wouldn’t have been impossible for this situation to occur at one of
the places I worked.
Below are my experiences related to the failings identified:
- There is the impression of time pressure. The new system had to
be released by a fixed date to match a change in an external market.
No doubt management were pushing staff to just get it done. In
front-office teams there was usually more tasks than people
available to complete them. Thus most common project size was one
person (so there was no one to check your work). Rarely was there
time to go back and fix things. It was assumed that the work would
be done fast and right the first time. Mistakes did occur.
- A flag on the messaging system was repurposed. Why not just
create a new one to ensure no interaction with systems using the old
flag? I have seen this done many times and I once did it myself. No
one wanted to do it, but it saved time. Messaging infrastructure was
always handled by a separate back-office team and at one bank they
only updated message fields every few months. If a coder needed a
new field before this update they would have to reuse an old one.
There was little other option.
- An old system lay unused for many years. All the banks had
numerous little programs listening to the message bus and working
away. All the places I worked were very keen to shutdown any unused
systems to free up computing resources. However, it is easy to
imagine one (or a bit of unused functionality in an otherwise used
system) slipping through the cull. With the average team member only
hanging around for a few years, knowledge of these old systems is
soon lost. Worse, sometimes other teams, without our knowledge,
created systems relying on our messages or databases (we couldn’t
restrict read-only access). We only discovered them when they phoned
to complain if we changed something (and we told them where to go).
- The new system was not deployed to one of the production
servers. Clearly the release process should be run from a script
that checks the result is correct and consistent. One bank had a
manual release process. Developers would wait until all the traders
had gone home and them manually copy and update files. At least a
shared drive was used by all the servers so only one copy was
required. Still it was error prone (releases regularly failed) and
hopefully they have changed it since. Little improvements were made
while I was there, but no one had the time to completely fix it.
- When errors started occurring 97 emails were sent detailing the
problem - all ignored. I’d be interested to know how many other
emails were sent. There were always floods of spurious error
messages caused by people logging problems at the wrong level or
just covering their behind (“but it sent an error email - someone
should have done something”). At two banks I got over a hundred
“error” messages per day from various systems. At first it really
caused concern, but others said that they could be safely ignored -
so they were ignored. It sent support people crazy, so there were
constant attempts to fix it with various degrees of success.
- The issue was multiplied by removing the new system from
production servers. This is so understandable in the banking IT
culture. If there is a problem just after a release the assumption
is always that the release caused it. Thus the release should be
“rolled-back”, returning the environment to the pre-release state.
Trading can then continue with the old version while IT can examine
the cause of the problem with less immediate concern. The pressure
to rollback can be intense, even if people are certain the problem
is not in the release - you need to be very sure of yourself. It is
like the old saying “no one ever lost their job buying IBM”, only
not quite in this case.
I found front-office development strongly path dependent as we rarely
had time to fix or improve old non-bug issues in code. We were always
moving forward, relying on good staff (rather than process or
automation) to avoid or ameliorate problems. Over time things improved -
every bank had a better environment than the previous one. I never had a
major issue. It would be nice to think this is the result of ability and
competence. Although reports on incidents like Knight Capital’s show
that when the environment and process is poor, luck also plays a strong
role. I have some empathy for the developers at Knight Capital.