Colin Percival's 20 Years on AWS: Own Everything, Blame ...

What happened

Colin Percival published a retrospective marking his twentieth year running infrastructure on Amazon Web Services. Percival is not a typical AWS customer. He built Tarsnap — an encrypted, deduplicated online backup service — on top of S3 and EC2 starting around 2006, when AWS was a sideshow that most enterprises dismissed as a toy for startups. He is also the creator of the scrypt key derivation function, a former FreeBSD Security Officer, and one of the more quietly influential systems engineers of the last two decades.

The title — "Never Not My Job" — captures the operating philosophy that has kept Tarsnap running on AWS for 20 years as essentially a one-person operation. When something breaks in the infrastructure underneath you, treating it as someone else's problem is a luxury solo operators cannot afford. Percival's track record backs this up: over the years he has discovered and publicly documented EC2 performance regressions, S3 consistency anomalies, and pricing quirks that larger organizations either missed entirely or took months to notice.

Why it matters

The cloud industry has spent the last decade selling abstraction. Serverless, managed services, platform engineering teams — the entire trajectory has been toward insulating application developers from infrastructure concerns. Percival's two-decade experience runs directly counter to that narrative.

His argument isn't that managed services are bad. It's that the mental model of "not my problem" creates a fragility that compounds over time. When you don't understand the layer beneath your abstraction, you can't diagnose intermittent failures, you can't optimize costs at the margin, and you can't make informed architectural decisions about what to build versus what to rent. Percival has been making these observations since before "cloud-native" was a term, and the industry keeps rediscovering them.

Consider the practical implications. Percival has historically found and reported AWS bugs that affected his service — then built workarounds before AWS acknowledged the issue. This isn't heroic; it's table stakes for anyone running production infrastructure with an SLA to uphold. But the industry has largely moved in the opposite direction, toward operational models where the first response to an infrastructure issue is to file a support ticket and wait.

The Hacker News discussion (215 points as of publication) reflects a community that recognizes the tension. Senior engineers who've been through multi-hour cloud outages — watching dashboards turn red while waiting on a provider's status page to update — viscerally understand Percival's position. Twenty years of uptime on a hyperscaler platform, maintained by one person, is a stronger argument for deep ownership than any conference talk about DevOps culture.

There's also the economics angle that Percival has written about extensively over the years. Running on AWS since 2006 means he's navigated every pricing model change, every instance generation transition, every storage class introduction. The institutional knowledge required to keep a service cost-efficient across two decades of AWS evolution is non-trivial. Most organizations rely on FinOps teams or third-party cost optimization tools for this. Percival does it by understanding the platform deeply enough to make architectural decisions that account for pricing.

What this means for your stack

The practical takeaway is uncomfortable for anyone building on cloud platforms today: your cloud provider's reliability is your problem, not theirs. Their SLA gives you credits. Your users expect uptime. The gap between those two things is entirely your responsibility.

For solo developers and small teams, Percival's approach offers a template. You don't need a platform engineering team to run reliable infrastructure. You need deep understanding of the specific services you depend on, monitoring that catches anomalies before they become outages, and the willingness to dig into failures that "shouldn't" be your concern. This is the opposite of the move-fast-and-break-things ethos — it's move-carefully-and-understand-everything.

For larger organizations, the lesson is about what you lose when you abstract too aggressively. Every layer of indirection between your engineers and the infrastructure is a layer of diagnostic capability you've surrendered. That tradeoff is sometimes worth making, but it should be a conscious decision, not a default. If your team's incident response begins with "let's check the cloud provider's status page," you've already lost precious minutes.

The specific techniques matter less than the mindset: instrument everything, maintain the ability to diagnose at every layer, and never assume that someone else's system will behave the way the documentation promises. This is not cynicism — it's engineering.

Looking ahead

Percival's retrospective arrives at a moment when the cloud industry is layering even more abstraction — AI-generated infrastructure, natural-language provisioning, autonomous remediation. These tools will be genuinely useful. They will also tempt organizations to understand their infrastructure even less than they do today. The engineers who thrive in that environment will be the ones who, like Percival, treat every layer of the stack as their job. Twenty years from now, that will still be the differentiator.

Colin Percival's 20 Years on AWS: Own Everything, Blame Nothing

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

20 Years on AWS and Never Not My Job

// community takes

Colin Percival's 20 Years on AWS: Own Everything, Blame Nothing

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

20 Years on AWS and Never Not My Job

// community takes

// share this