Kingsbury argues that the gap between what software claims to do and what it actually does is widening across the entire stack — from databases marketing consistency guarantees they don't deliver, to cloud SLAs that are aspirational rather than contractual, to AI coding tools generating unverifiable output. His decade of Jepsen testing revealed that vendors often knew their guarantees were broken and shipped anyway, and he sees this pattern now metastasizing industry-wide.
By surfacing Kingsbury's post to the HN community (431 points, 446 comments), pabs3 highlighted the resonance of the argument that practitioners feel this verification gap acutely. The strong community response suggests widespread agreement that the problem is real and that Jepsen-style verification, while valuable, cannot scale to cover the expanding layers of unverified claims across infrastructure, AI tooling, and cloud services.
Kingsbury extends his critique beyond traditional database vendors to AI coding tools, arguing they produce plausible output with no mechanism for verifying correctness. Unlike databases where Jepsen could at least test marketed guarantees after the fact, AI-generated code introduces a layer where the gap between claimed and actual behavior is structurally unmeasurable at scale.
Kyle Kingsbury — better known as Aphyr, the person behind Jepsen, the most rigorous independent correctness testing suite for distributed databases — published a post titled "The Future of Everything Is Lies, I Guess." The title is weary rather than angry, which makes it hit harder. Coming from someone who has spent over a decade methodically proving that databases lose your data in ways their vendors swore they wouldn't, this isn't a hot take. It's a field report.
Kingsbury's core argument is that the software industry has moved from occasionally shipping broken things to *structurally incentivizing* dishonesty about what systems actually do. The post traces this pattern across multiple domains: database vendors who market consistency guarantees their products don't deliver, cloud providers whose SLAs are aspirational rather than contractual in any meaningful sense, and now AI coding tools that generate plausible output with no mechanism for verifying correctness.
The HN discussion — 431 points and climbing — resonated because it named something practitioners feel but rarely articulate: the gap between what software *claims* to do and what it *actually does* is widening, and the industry has decided that's fine.
Kingsbury has unique credibility here. Jepsen tests have uncovered data-loss bugs in MongoDB, CockroachDB, Redis, RabbitMQ, Elasticsearch, and dozens of other systems that marketed themselves as safe. In many cases, the bugs were in exactly the features the vendors highlighted in their sales materials. The pattern Jepsen exposed wasn't that distributed systems are hard — everyone knows that — but that vendors *knew* their guarantees were broken and shipped anyway.
What's new in Kingsbury's argument is the scope creep. It's no longer just databases playing fast and loose with consistency semantics. The entire stack is now built on layers of unverified claims:
Infrastructure layer: Cloud providers advertise "eleven nines" durability numbers that are extrapolations, not measurements. When S3 had its 2017 outage, it turned out nobody had actually tested what happens when a human typos a command that takes out a major subsystem. The SLA refund for downtime is typically a service credit — not compensation for the business damage the downtime caused.
Application layer: ORMs, frameworks, and libraries ship with documented behaviors that are really documented *intentions*. Edge cases go untested. Performance claims come from benchmarks designed to make the tool look good. And because most applications don't have property-based tests or formal verification, nobody discovers the gaps until a specific production workload hits them.
AI layer: This is where Kingsbury's argument gets its sharpest edge. AI code generation tools produce output that looks correct — syntactically valid, stylistically plausible, often functional for the happy path — but with no mechanism for the model to know or communicate what it doesn't know. An LLM that generates a database query doesn't understand transaction isolation. It pattern-matches from training data and produces something that usually works. The failure mode isn't "obviously broken code" — it's subtly wrong code that passes code review because it *looks* right.
The community response split into two camps. One group — mostly people who've been burned by exactly these issues — treated the post as vindication. The other pushed back, arguing that imperfect software that ships is better than perfect software that doesn't, and that the industry has always been this way. Both sides have a point, but they're talking past each other. The question isn't whether trade-offs are necessary. It's whether the trade-offs are being *disclosed*.
If you're a senior engineer reading this, the practical implications are uncomfortable but actionable.
First, audit your trust chain. Every system in your stack makes claims — about durability, consistency, performance, security. For each critical claim, ask: has anyone actually tested this? Not "did the vendor say they tested it" — has *your team* verified the behavior under *your workload*? If you're running a database that claims linearizable reads, and you haven't tested that claim under your actual failure scenarios, you don't have linearizable reads. You have a marketing promise.
Second, treat AI-generated code the way you'd treat code from a confident but junior engineer who just started last week. It might be correct. It's probably mostly correct. But the confidence it projects has zero correlation with its actual correctness. Code review processes need to adapt: reviewing AI-generated code isn't about style nits, it's about whether the code handles the cases the model never considered because it doesn't actually understand your domain.
Third, invest in verification that's proportional to your risk. You don't need Jepsen-level testing for your blog's RSS feed. But if you're handling financial transactions, health data, or anything where "usually works" isn't good enough, the Aphyr principle applies: trust nothing, verify everything, and be deeply suspicious of any system that makes strong guarantees but resists being tested.
Property-based testing (Hypothesis, fast-check, QuickCheck) is the most accessible first step. Chaos engineering (killing processes, partitioning networks, corrupting data) is the next level. Formal methods remain niche but are gaining traction for critical paths. The point isn't perfection — it's knowing where your fiction begins.
The irony Kingsbury identifies is that AI was supposed to help us write better software, but its dominant effect so far has been to increase the volume of unverified code. More code, produced faster, with less understanding of what it does. The correction won't come from the market — vendors have no incentive to be honest about limitations when competitors aren't. It'll come from the engineers who decide that knowing what their systems actually do is part of the job, not an optional extra. Jepsen proved that one person with good tooling can hold an entire industry accountable. The question is whether the next generation of developers will see that as inspiring or quaint.
> 2017’s Attention is All You Need was groundbreaking and paved the way for ChatGPT et al. Since then ML researchers have been trying to come up with new architectures, and companies have thrown gazillions of dollars at smart people to play around and see if they can make a better kind of model.
> It remains unclear whether continuing to throw vast quantities of silicon and ever-bigger corpuses at the current generation of models will lead to human-equivalent capabilities. Massive increases in training costs and parameter count seem to be yielding diminishing returns. Or maybe this effec
I think the discussion has to be more nuanced than this. "LLMs still can't do X so it's an idiot" is a bad line of thought. LLMs with harnesses are clearly capable of engaging with logical problems that only need text. LLMs are not there yet with images, but we are improving with
Thank you for putting it so succinctly.I keep explaining to my peers, friends and family that what actually is happening inside an LLM has nothing to do with conscience or agency and that the term AI is just completely overloaded right now.
Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.
There is a whole giant essay I probably need to write at some point, but I can't help but see parallels between today and the Industrial Revolution.Prior to the industrial revolution, the natural world was nearly infinitely abundant. We simply weren't efficient enough to fully exploit it.