It’s Tuesday, June 4th, 1996, and the European Space Agency is set to launch its new Ariane 5 rocket for the first time. This is the culmination of a decade of design, testing and a budget spending billions of euros.
The goal of Ariane 5 is simple, but the stakes are high. It was designed to carry large, expensive payloads, both for scientific experiments and commercial purposes.
The rocket carried no astronauts. The first payload, the Cluster spacecraft, was made of four very expensive scientific satellites weighing 2,600 lbs each, to be delivered on an elliptical orbit.
Just 40 seconds after take-off, however, huge chunks of metal and burning fragments of Ariane Flight 501 are crashing down over the launch area. A shocking disaster for the ESA and a rough setback for the mission.
The cause? A simple, and very much avoidable coding bug, from a piece of dead code, left over from the previous Ariane 4 mission, which started nearly a decade before. Here’s what happened exactly.
The Ariane Flight 501 leaves the launchpad and accelerates smoothly according to its predetermined path towards space. Inside, the guidance system is constantly tracking the rocket’s trajectory and sending data over to the main on-board computer. To achieve this, the guidance system converts the velocity readings, from 64 bit floating point to 16 bit signed integer.
Okay, let’s take a moment and think about what this actually means. With 16-bit unsigned integers, you can store anything from 0 to 65,535. If you use the first bit to store a sign (positive/negative) and your 16-bit signed integer now covers everything from -32,768 to +32,767 (only 15 bits left for the actual number). Anything bigger than these values and you’ve run out of bits.
On the other hand, floating point numbers are stored a bit differently. They were designed to track a wider range of numbers, using the same number of bits. For example, even a 16-bit (“double precision”) float stores a value ranging between -1.8e+308 and -2.2e-308. Now, try storing one into a 16-bit signed integer, and the number is very much out of the bounds of the signed integer. Go further and try storing a 64-bit float and the situation is made worse.
So what happened when the inevitable came true? Well, in this case, when the 16-bit signed integer was used, the conversion from float to integer wrapped around to the beginning again and finally ran into the very familiar integer overflow. Riight, so back to the rocket story.
The guidance system reads the horizontal velocity data of the rocket (a 64 bit floating point) and, unsuccessfully, tries converting it to a 16 bit integer to send over to the main computer.
However, the reading is larger than the biggest possible 16 bit integer, a conversion is tried and fails. Usually, a well-designed system would have a procedure built-in to handle an overflow error and send a sensible message to the main computer. This, however, wasn’t one of those cases.
The guidance system proceeds to send an error message instead of the horizontal velocity value and shuts down immediately.
But wait a minute, there might be a saving chance! The system is designed to have a backup, standby system, which unfortunately, runs the exact same code. It tries the same conversion, gets the same error and just 72 milliseconds later, promptly crashes.
Because there was no exception handling code, the main computer interprets the data as real navigation data and considers it as an indication that the rocket is wildly off-course. In an attempt to save itself from a nonexistent threat, the full nozzle deflections of the boosters are fired up, putting immense aerodynamic strain on the rocket, which starts tearing apart the rocket immediately.
Registering that things couldn’t be worse anyway, the computer decides to trigger the self-destruct mechanism and puts on a fireworks display (worth some €500 million at the time).
So what was the ultimate cause of this very short, very expensive and catastrophic flight? A line of code converting a 64-bit floating point to a signed integer, which led to an overflow passed directly to the main computer, that interpreted it as real data.
The same software was designed and used successfully on many flights previously, on the Ariane 4 rockets, which were much smaller. But the new Ariane 5 model was designed to fly faster than the systems engineers had planned when that code was originally created.
That same higher velocity led to the overflow error, which could have been caught easily. But it wasn’t.
The worst part? The code wasn’t necessary after takeoff, it was only part of the launch pad alignment process. But sometimes a trivial glitch might delay a launch by a few seconds and, in trying to save having to reset the whole system, the original software engineers decided that the sequence of code should run for an extra… 40 seconds after the scheduled liftoff.
This is the first post in our series on famous bugs! At Jam, we help people catch bugs before they crash rocket ships, or worse, production. Hope you'll give Jam a try!