Aphyr: We Stopped Caring Whether Software Actually Works

What happened

Kyle Kingsbury — better known as Aphyr, the person behind Jepsen, the most rigorous independent correctness testing suite for distributed databases — published a post titled "The Future of Everything Is Lies, I Guess." The title is weary rather than angry, which makes it hit harder. Coming from someone who has spent over a decade methodically proving that databases lose your data in ways their vendors swore they wouldn't, this isn't a hot take. It's a field report.

Kingsbury's core argument is that the software industry has moved from occasionally shipping broken things to *structurally incentivizing* dishonesty about what systems actually do. The post traces this pattern across multiple domains: database vendors who market consistency guarantees their products don't deliver, cloud providers whose SLAs are aspirational rather than contractual in any meaningful sense, and now AI coding tools that generate plausible output with no mechanism for verifying correctness.

The HN discussion — 431 points and climbing — resonated because it named something practitioners feel but rarely articulate: the gap between what software *claims* to do and what it *actually does* is widening, and the industry has decided that's fine.

Why it matters

Kingsbury has unique credibility here. Jepsen tests have uncovered data-loss bugs in MongoDB, CockroachDB, Redis, RabbitMQ, Elasticsearch, and dozens of other systems that marketed themselves as safe. In many cases, the bugs were in exactly the features the vendors highlighted in their sales materials. The pattern Jepsen exposed wasn't that distributed systems are hard — everyone knows that — but that vendors *knew* their guarantees were broken and shipped anyway.

What's new in Kingsbury's argument is the scope creep. It's no longer just databases playing fast and loose with consistency semantics. The entire stack is now built on layers of unverified claims:

Infrastructure layer: Cloud providers advertise "eleven nines" durability numbers that are extrapolations, not measurements. When S3 had its 2017 outage, it turned out nobody had actually tested what happens when a human typos a command that takes out a major subsystem. The SLA refund for downtime is typically a service credit — not compensation for the business damage the downtime caused.

Application layer: ORMs, frameworks, and libraries ship with documented behaviors that are really documented *intentions*. Edge cases go untested. Performance claims come from benchmarks designed to make the tool look good. And because most applications don't have property-based tests or formal verification, nobody discovers the gaps until a specific production workload hits them.

AI layer: This is where Kingsbury's argument gets its sharpest edge. AI code generation tools produce output that looks correct — syntactically valid, stylistically plausible, often functional for the happy path — but with no mechanism for the model to know or communicate what it doesn't know. An LLM that generates a database query doesn't understand transaction isolation. It pattern-matches from training data and produces something that usually works. The failure mode isn't "obviously broken code" — it's subtly wrong code that passes code review because it *looks* right.

The community response split into two camps. One group — mostly people who've been burned by exactly these issues — treated the post as vindication. The other pushed back, arguing that imperfect software that ships is better than perfect software that doesn't, and that the industry has always been this way. Both sides have a point, but they're talking past each other. The question isn't whether trade-offs are necessary. It's whether the trade-offs are being *disclosed*.

What this means for your stack

If you're a senior engineer reading this, the practical implications are uncomfortable but actionable.

First, audit your trust chain. Every system in your stack makes claims — about durability, consistency, performance, security. For each critical claim, ask: has anyone actually tested this? Not "did the vendor say they tested it" — has *your team* verified the behavior under *your workload*? If you're running a database that claims linearizable reads, and you haven't tested that claim under your actual failure scenarios, you don't have linearizable reads. You have a marketing promise.

Second, treat AI-generated code the way you'd treat code from a confident but junior engineer who just started last week. It might be correct. It's probably mostly correct. But the confidence it projects has zero correlation with its actual correctness. Code review processes need to adapt: reviewing AI-generated code isn't about style nits, it's about whether the code handles the cases the model never considered because it doesn't actually understand your domain.

Third, invest in verification that's proportional to your risk. You don't need Jepsen-level testing for your blog's RSS feed. But if you're handling financial transactions, health data, or anything where "usually works" isn't good enough, the Aphyr principle applies: trust nothing, verify everything, and be deeply suspicious of any system that makes strong guarantees but resists being tested.

Property-based testing (Hypothesis, fast-check, QuickCheck) is the most accessible first step. Chaos engineering (killing processes, partitioning networks, corrupting data) is the next level. Formal methods remain niche but are gaining traction for critical paths. The point isn't perfection — it's knowing where your fiction begins.

Looking ahead

The irony Kingsbury identifies is that AI was supposed to help us write better software, but its dominant effect so far has been to increase the volume of unverified code. More code, produced faster, with less understanding of what it does. The correction won't come from the market — vendors have no incentive to be honest about limitations when competitors aren't. It'll come from the engineers who decide that knowing what their systems actually do is part of the job, not an optional extra. Jepsen proved that one person with good tooling can hold an entire industry accountable. The question is whether the next generation of developers will see that as inspiring or quaint.

Aphyr: We Stopped Caring Whether Software Actually Works

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

The Future of Everything Is Lies, I Guess

// community takes

Aphyr: We Stopped Caring Whether Software Actually Works

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

The Future of Everything Is Lies, I Guess

// community takes

// share this