Reliability · From the Foundry

What aviation knows that software forgot

May 21, 2026 6 min read Leo Dias

Este ensaio está disponível apenas em inglês.

How a powerless 767 and a forgotten switch explain good engineering

My ten-year-old is obsessed with aviation. He often tells me which airlines fly which aircraft out of which airports, and describes the engine layouts for commercial jets I’ve never heard of. I say this so you’ll give me some leeway with an overused analogy (aviation reliability and parallels in other industries).

One story I keep circling back to is the Gimli Glider. Probably because it’s more than an aviation story. It’s a story about how you build anything that has to be trusted.

On July 23, 1983, an Air Canada Boeing 767 ran out of fuel at 41,000 feet over Red Lake, ON. Both engines quit. The cockpit screens went dark; most of the instruments died with the generators. What was, minutes earlier, one of the most advanced airliners in the world became a 130-tonne glider with sixty-nine people inside it.

It landed.

The captain, Bob Pearson, flew gliders as a hobby, and he brought the powerless jet down onto a decommissioned airbase at Gimli, MB — a runway his first officer, Maurice Quintal, happened to know from his Royal Canadian Air Force days, and which had since been turned into a drag strip, with cars on it that afternoon. Nobody died.

Ten minor injuries, all from the evacuation slides. The aircraft eventually flew again.

We remember it as a story about two unflappable pilots. It’s really a story about layers. Every system that was supposed to keep that plane fuelled had failed — and behind those failed layers was another one: two people trained, drilled, and equipped to fly an airliner with no engines onto a runway full of cars. Aviation builds like that on purpose. It’s unfortunate that software, mostly, does not.

Safety on purpose

Commercial aviation is the safest complex system human beings operate. It did not get there by hiring better people than the software industry. It got there by taking time to assume that people will fail — skilled, rested, well-meaning people, on an ordinary Tuesday — and building so that a single failure rarely reaches the ground. The discipline isn’t there to flatter anyone. It’s there because someone, somewhere in the system, is going to have a bad day, and the structure would rather that bad day not become the last link in a chain.

Software keeps rediscovering these same lessons, one outage at a time, and keeps filing them under “overhead.”

The checklist

On October 30, 1935, at Wright Field in Ohio, Boeing flew the prototype of what would become the B-17 — then just the Model 299. It was the most sophisticated aircraft anyone had built. It lifted off, climbed a few hundred feet, stalled, and crashed, killing the pilot and Boeing’s chief test pilot. The cause was almost insultingly small: someone had left a control lock engaged — a pin that keeps the elevator from flapping in the wind while the plane is parked — so the aircraft couldn’t be steered once it was airborne. Two of the best test pilots in the world, defeated by a switch.

The press concluded the plane was “too much airplane for one man to fly.” The investigators concluded something more useful: it wasn’t too much to fly, it was too much to remember. The fix wasn’t a better pilot. It was a piece of paper — the pre-flight checklist. Boeing went on to build thousands of them.

A checklist makes sense, but the concept is often resisted. On its face, a checklist is an insult to your competence. It asks you to confirm things you obviously know. Because the checklist was never for the flight where the crew is sharp and unhurried. It’s for the flight where they’re tired, running late, distracted by something on the ground — the one where memory quietly skips item nine but is totally sure it didn’t.

Software has the same tool under different names: runbooks, deployment checklists, change controls, the boring template at the top of the incident doc. And it routinely treats them as bureaucracy — friction for people not good enough to keep it in their heads. That gets it exactly backwards. The checklist exists because good people forget, and they forget most reliably when the stakes and the pressure are highest. The discipline is boring on every day except the one where it saves you, and you don’t get to know in advance which day that is.

When the holes line up

The other idea aviation gave us is the best mental model of failure I know. The psychologist James Reason called it the “swiss cheese” model: a system’s defences are like slices of cheese stacked together. Every slice has holes — no layer is perfect. Most of the time the holes don’t line up, and a problem that slips through one is caught by the next. Disaster happens only on the rare occasion that the holes in every slice align, and something falls clean through.

That 767 is a textbook stack of aligned holes. The fuel gauge system was unserviceable, so the crew measured fuel by hand. The hand calculation used a conversion factor of 1.77 — pounds per litre — instead of 0.8 kilograms per litre, because Air Canada was mid-switch to metric and this was its first metric aircraft. Nobody in the chain had the experience to feel that the number was wrong. Pull any one of those slices out of the stack and it’s a forgettable flight. It took all of them, lined up, to empty the tanks at altitude.

Software incidents work the same way. We write “root cause: the software bug” because we want a single thing to point at — one bug, one throat to choke. But it’s almost never one thing. It’s the bug, and the alert that was muted three weeks ago, and the runbook that went stale, and the one engineer who understood the system having left in March. The honest post-mortem doesn’t hunt for the villain. It maps the cheese, and asks which slices we can make whole.

This is also why aviation investigates without blame. If failure is systemic — say, if it took five aligned holes — then firing whoever happened to be holding the last slice teaches you nothing and removes a person who now understands the failure better than anyone. The point of looking is to thicken the layers, not to find someone to punish for being there when they aligned.

Where the comparison breaks

Let’s be honest about the limits of this analogy, because the lazy version of this essay is “software should just be more like aviation,” and it can’t be, entirely.

Aviation’s safety isn’t only culture. It’s structure that software doesn’t have and largely can’t manufacture. There is one investigator with statutory teeth. Incident reporting is mandatory and protected. A single authority can ground an entire fleet overnight. Airframes change over decades. Software ships to production on a Tuesday afternoon and ships again on Wednesday morning, and there is no regulator standing on the runway who can stop the deploy. “Blameless post-mortem” is printed in every engineering handbook and lived out very unevenly.

And some of the gap is legitimate. A crashed web app does not kill two hundred people; the right amount of process genuinely scales with the stakes, and a payments ledger or a hospital system earns more rigour than a marketing site. Importing aviation’s full machinery into a three-person startup would be its own kind of malpractice.

The part that’s free

But notice what the two borrowed lessons actually cost. The checklist costs you a few seconds of reading something you already know. The swiss-cheese view costs you the satisfying story with one clear villain. Neither requires a national safety board or the power to ground a fleet. They aren’t regulations. They’re a posture — and the posture is available to anyone willing to adopt it before anything goes wrong.

That posture is a decision: to treat your own competence as the single thing most likely to fail you, and to build the quiet layer behind it anyway — the one you will resent maintaining on all the days you don’t need it.

The crew of that 767 didn’t glide it onto a drag strip because they were heroes. They did it because someone, decades before, decided that “we’ll remember” wasn’t good enough — and then kept deciding it, layer after layer, all the way down, until the afternoon every other layer failed and the last one held. Good software, in the places where it actually matters, is the same bet: that the boring layer you will never need is the one that saves you on the day you finally do.