Colin Percival's 20 Years on AWS: Own Everything, Blame Nothing

4 min read 1 source clear_take
├── "Deep infrastructure ownership beats abstraction for long-running solo operations"
│  └── Colin Percival (daemonology.net) → read

Percival argues that treating infrastructure problems as 'never not my job' is what enabled Tarsnap to run reliably for 20 years as a one-person operation. His hands-on approach let him discover EC2 performance regressions, S3 consistency anomalies, and pricing quirks that larger organizations missed entirely or took months to notice.

├── "The 'not my problem' mindset toward cloud infrastructure creates compounding fragility"
│  ├── Colin Percival (daemonology.net) → read

Percival contends that when developers don't understand the layer beneath their abstraction, they can't diagnose intermittent failures, optimize costs at the margin, or make informed build-vs-rent decisions. The industry's push toward serverless and managed services sells insulation from infrastructure concerns, but this creates brittleness that compounds over time.

│  └── top10.dev editorial (top10.dev) → read below

The editorial notes that the cloud industry has spent a decade selling abstraction through serverless and platform engineering teams, yet Percival's two-decade experience runs directly counter to that narrative. The industry keeps rediscovering the lessons Percival has been articulating since before 'cloud-native' was a term.

└── "Solo operators who monitor their own infrastructure catch problems that large organizations miss"
  └── Colin Percival (daemonology.net) → read

Over 20 years, Percival has publicly documented EC2 performance regressions, S3 consistency anomalies, and pricing quirks before larger organizations noticed them. His direct, unmediated relationship with the infrastructure — without layers of platform teams or abstraction — gives him visibility that enterprises with dedicated SRE teams lack.

What happened

Colin Percival published a retrospective marking his twentieth year running infrastructure on Amazon Web Services. Percival is not a typical AWS customer. He built Tarsnap — an encrypted, deduplicated online backup service — on top of S3 and EC2 starting around 2006, when AWS was a sideshow that most enterprises dismissed as a toy for startups. He is also the creator of the scrypt key derivation function, a former FreeBSD Security Officer, and one of the more quietly influential systems engineers of the last two decades.

The title — "Never Not My Job" — captures the operating philosophy that has kept Tarsnap running on AWS for 20 years as essentially a one-person operation. When something breaks in the infrastructure underneath you, treating it as someone else's problem is a luxury solo operators cannot afford. Percival's track record backs this up: over the years he has discovered and publicly documented EC2 performance regressions, S3 consistency anomalies, and pricing quirks that larger organizations either missed entirely or took months to notice.

Why it matters

The cloud industry has spent the last decade selling abstraction. Serverless, managed services, platform engineering teams — the entire trajectory has been toward insulating application developers from infrastructure concerns. Percival's two-decade experience runs directly counter to that narrative.

His argument isn't that managed services are bad. It's that the mental model of "not my problem" creates a fragility that compounds over time. When you don't understand the layer beneath your abstraction, you can't diagnose intermittent failures, you can't optimize costs at the margin, and you can't make informed architectural decisions about what to build versus what to rent. Percival has been making these observations since before "cloud-native" was a term, and the industry keeps rediscovering them.

Consider the practical implications. Percival has historically found and reported AWS bugs that affected his service — then built workarounds before AWS acknowledged the issue. This isn't heroic; it's table stakes for anyone running production infrastructure with an SLA to uphold. But the industry has largely moved in the opposite direction, toward operational models where the first response to an infrastructure issue is to file a support ticket and wait.

The Hacker News discussion (215 points as of publication) reflects a community that recognizes the tension. Senior engineers who've been through multi-hour cloud outages — watching dashboards turn red while waiting on a provider's status page to update — viscerally understand Percival's position. Twenty years of uptime on a hyperscaler platform, maintained by one person, is a stronger argument for deep ownership than any conference talk about DevOps culture.

There's also the economics angle that Percival has written about extensively over the years. Running on AWS since 2006 means he's navigated every pricing model change, every instance generation transition, every storage class introduction. The institutional knowledge required to keep a service cost-efficient across two decades of AWS evolution is non-trivial. Most organizations rely on FinOps teams or third-party cost optimization tools for this. Percival does it by understanding the platform deeply enough to make architectural decisions that account for pricing.

What this means for your stack

The practical takeaway is uncomfortable for anyone building on cloud platforms today: your cloud provider's reliability is your problem, not theirs. Their SLA gives you credits. Your users expect uptime. The gap between those two things is entirely your responsibility.

For solo developers and small teams, Percival's approach offers a template. You don't need a platform engineering team to run reliable infrastructure. You need deep understanding of the specific services you depend on, monitoring that catches anomalies before they become outages, and the willingness to dig into failures that "shouldn't" be your concern. This is the opposite of the move-fast-and-break-things ethos — it's move-carefully-and-understand-everything.

For larger organizations, the lesson is about what you lose when you abstract too aggressively. Every layer of indirection between your engineers and the infrastructure is a layer of diagnostic capability you've surrendered. That tradeoff is sometimes worth making, but it should be a conscious decision, not a default. If your team's incident response begins with "let's check the cloud provider's status page," you've already lost precious minutes.

The specific techniques matter less than the mindset: instrument everything, maintain the ability to diagnose at every layer, and never assume that someone else's system will behave the way the documentation promises. This is not cynicism — it's engineering.

Looking ahead

Percival's retrospective arrives at a moment when the cloud industry is layering even more abstraction — AI-generated infrastructure, natural-language provisioning, autonomous remediation. These tools will be genuinely useful. They will also tempt organizations to understand their infrastructure even less than they do today. The engineers who thrive in that environment will be the ones who, like Percival, treat every layer of the stack as their job. Twenty years from now, that will still be the differentiator.

Hacker News 226 pts 55 comments

20 Years on AWS and Never Not My Job

→ read on Hacker News
gobdovan · Hacker News

The author calls it a 'joke' that Heroes are just unpaid Amazon employees, but reality doesn't become a joke just because it's funny. The asymmetry here is staggering. I find myself holding back private research because I don't want to provide free R&D for a value-extrac

MyUltiDev · Hacker News

A 20 year retrospective with no Hetzner or OVH numbers in sight is a bit of a tell. I run workloads across AWS, Hetzner, and a couple of smaller providers, and the gap is not subtle. For a small to medium web stack you are looking at roughly $350 a month on AWS versus 20 to 25 euros on Hetzner for s

CoryOndrejka · Hacker News

> in fact in one of Jeff Barr's AWS user meetups in Second LifeThere's so much about that phrase that makes me smile. Easy to forget that Second Life was also one of the earliest users of AWS, S3 first. Jeff Bezos had personally invested in our 2005 round (a round that made Linden Lab a

anilgulecha · Hacker News

I understand people have a viewpoint here about not giving time to large behemoths. I'll counter with a story and perhaps a larger point.Back in 2006/7 I had an idea for a project for which, in all enthusiasm, I setup a mailing list, but ended up never pursuing it. It's a very unique

few · Hacker News

> In April 2024 I confided in an Amazonian that I was "not really doing a good job of owning FreeBSD/EC2 right now" and asked if he could find some funding to support my work, on the theory that at a certain point time and dollars are fungible>I received sponsorship from Amazon

// share this

// get daily digest

Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.