The 2001 MIT paper that explains why your SRE team is dr...

What happened

A PDF from 2001 just clocked 356 points on Hacker News. The paper — Nelson Repenning and John Sterman's *Nobody Ever Gets Credit for Fixing Problems That Never Happened*, published in the California Management Review — is a system dynamics study of why well-meaning process improvement programs collapse inside well-run organizations. It is, at its surface, a paper about manufacturing. Underneath, it is the most accurate description of modern software operations anyone has ever written.

The paper introduces the capability trap: a self-reinforcing loop where short-term pressure to ship causes a cut in 'work to improve the work,' which causes more defects, which causes more firefighting, which leaves even less time to improve the work. Repenning and Sterman modeled it with differential equations against two real industrial case studies. The math is unkind. Once an organization enters the trap, the rational individual decision at every step — push harder, ship faster, defer the cleanup — drives the system deeper in.

The HN thread is a parade of engineers recognizing their own org charts. The top comments aren't about manufacturing. They're about on-call rotations, security teams whose budgets get cut because there hasn't been a breach, the SRE who got passed over while the engineer who 'saved the launch' got promoted, the platform team disbanded for not shipping customer-visible features.

Why it matters

The paper's central insight is an attribution problem dressed as a management one. Prevention is invisible by construction: the incident that didn't happen has no Jira ticket, no postmortem, no Slack thread, no exec-visible save. The engineer who spent three weeks on a deploy pipeline that eliminated a whole class of outage looks, on the quarterly review, less productive than the one who heroically debugged a prod fire at 3am that the pipeline would have prevented.

Repenning and Sterman call this the better-before-worse problem flipped: real improvement work makes things *look* worse before they get better, because you're spending capacity on infrastructure instead of output. Most management interventions die in that dip. The team that finally gets headcount to pay down test debt watches throughput drop for a quarter, gets reorged, and the debt comes back compounded.

The paper's mathematical result is that there is no stable equilibrium between firefighting and improvement — the system either tips into a virtuous cycle of compounding capability, or a vicious one of compounding fragility. Middle ground is unstable. This matches what every senior engineer has watched happen to teams over a 5-year window. There's no such thing as a team that's 'managing tech debt at a steady state.' It's either getting better or it's quietly getting worse, and the latter looks identical to the former on a sprint burndown until the day it doesn't.

The 2026 mapping is too tight to ignore. The dominant incentive structures in software still reward visible saves over invisible prevention. Promotion packets get written around incident response, not incident absence. AI coding tools have made the asymmetry worse: a developer using Cursor or Claude Code to ship features can outproduce a developer maintaining the test suite, monitoring, and deployment infrastructure that makes shipping safe by an order of magnitude. The capability trap accelerates when raw output gets cheaper and the boring work that makes output safe doesn't.

The security parallel is sharpest. Every security team in the world is funded as if the absence of breaches is evidence the team isn't needed, rather than evidence the team is working. Repenning and Sterman wrote the model for this exact dynamic 25 years ago. The model predicts the same outcome every time: cut prevention, congratulate the cut as a productivity win, eat the breach 18 months later, blame the security team for not seeing it coming, hire a CISO, repeat.

What this means for your stack

Three concrete moves come out of the paper, all of them uncomfortable.

First, measure the invisible. Repenning and Sterman are emphatic that the capability trap is sustained by an attribution failure, not a knowledge failure. People know prevention matters. They just can't see it in the metrics. Teams that escape the trap make prevention legible: 'incidents prevented by this control' as a tracked number, error budgets that get spent (a SLO at 100% means you're over-investing in reliability), counterfactual postmortems that explicitly name the work that *would have* prevented the outage. None of this is new — Google's SRE book reinvented half of it — but the source paper makes the *why* mathematical, not cultural.

Second, defend the dip. The paper shows that improvement programs predictably make things look worse for one to three reporting periods before compounding. If your team can't survive a quarter of visibly reduced output, you cannot escape the capability trap, period. This is mostly an executive-air-cover problem, not an engineering problem. Get the dip funded, in writing, before the work starts. The engineers who succeed at platform migrations, test infrastructure rewrites, and deprecations are almost never the technically best ones — they're the ones with a VP who didn't blink during the trough.

Third, reward absence. The hardest move. Promote the engineer whose service had zero incidents this quarter. Give the bonus to the team that retired a system. Name the deprecation in the all-hands. This is unnatural — you are rewarding people for something that didn't happen — but the paper's math is unambiguous about what happens when you don't. The capability trap is a closed-loop control system. If the gain on prevention is zero, the system has a single attractor, and it is on fire.

Looking ahead

The reason this 2001 paper is back on the front page in 2026 is that the software industry is currently running the largest natural experiment in the capability trap that has ever existed. AI-assisted output is up, prevention work is flat or down, attribution is more lopsided than ever (the AI 'wrote' the feature; the human still has to maintain it). The paper's model predicts what happens next, and it doesn't require a forecast — just patience. Read it. Then go look at your last six promotion packets and count how many were heroics versus how many were quiet, boring, structural work that made heroics unnecessary. That ratio is your team's trajectory.

The 2001 MIT paper that explains why your SRE team is drowning

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

Nobody ever gets credit for fixing problems that never happened (2001) [pdf]

// community takes

The 2001 MIT paper that explains why your SRE team is drowning

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

Nobody ever gets credit for fixing problems that never happened (2001) [pdf]

// community takes

// share this