Aphyr Applies Jepsen-Grade Rigor to AI Hype. It Doesn't ...

What Happened

Kyle Kingsbury — better known as aphyr, the engineer behind Jepsen, the distributed systems correctness testing suite that has found critical bugs in nearly every database it has examined — published a two-part essay series titled "The Future of Everything Is Lies, I Guess." The first installment covers Safety, the second covers Work. Together, they form the most technically grounded critique of AI industry claims published this year.

The posts landed on Hacker News with a combined score of 520 points, which for aphyr's characteristically long-form, evidence-heavy writing style, signals that the arguments resonated well beyond the usual AI-skeptic crowd. Kingsbury isn't an AI doomer or a Luddite — he's the person companies hire when they want to know if their database actually works as advertised. That credibility is what makes this series hit differently than the average "AI bad" blog post.

The Safety Argument

Kingsbury's safety critique draws a direct line from his Jepsen work to AI safety claims. His central observation: the AI industry has adopted the word "safety" as a marketing term while systematically undermining the engineering practices that would make systems actually safe.

He examines how AI labs publish safety benchmarks that measure narrow, easily-gamed metrics rather than real-world harm. A model can score well on a multiple-choice safety evaluation while still confidently hallucinating medical advice, fabricating legal citations, or generating instructions for harm when prompted with modest creativity. The benchmarks test whether the model can identify the "correct" answer on a test, not whether it behaves safely when deployed to millions of users with unpredictable inputs.

This parallels what Jepsen found in the database world: vendors would claim "ACID compliance" or "strong consistency" based on their own internal tests, only for Jepsen to reveal data loss and consistency violations under realistic failure conditions. The gap between claimed and actual behavior wasn't a bug — it was a business model. Kingsbury argues AI safety is following the same playbook.

He also notes the organizational signals: safety teams at major AI labs have been repeatedly restructured, downsized, or overruled when their findings conflicted with shipping timelines. When the people whose job is to say "this isn't ready" keep getting moved out of the critical path, the system is telling you what it actually values.

The Work Argument

The second post turns the same analytical lens on productivity claims. Kingsbury tests AI coding assistants against his own real-world tasks — not toy examples or LeetCode problems, but the kind of gnarly systems programming that working engineers actually do.

His findings will be familiar to anyone who has used these tools seriously: AI-generated code is fluent, plausible, and wrong in ways that are often harder to detect than writing the code yourself. The output looks correct at a glance. It uses appropriate variable names, follows conventions, and structures code in familiar patterns. But the logic contains subtle errors — off-by-one mistakes, incorrect edge case handling, misunderstood API contracts — that require careful line-by-line review to catch.

This creates a paradox that Kingsbury identifies clearly: the time saved by generating code is consumed (and often exceeded) by the time required to verify it. For experienced engineers who can spot these errors, AI assistants provide modest utility for boilerplate and exploration. For junior engineers who lack the expertise to recognize subtle bugs, the tools are actively dangerous — they produce a false sense of progress while embedding defects that surface later.

The productivity studies cited by AI companies tend to measure speed of initial code generation, not the full lifecycle cost including review, debugging, and maintenance. Kingsbury compares this to measuring a database's write throughput without measuring whether the data is actually there when you read it back.

Why This Matters More Than Another AI Skepticism Post

The tech industry has no shortage of AI criticism. What makes Kingsbury's contribution distinct is methodological: he applies the same framework that made Jepsen valuable. Don't trust the vendor's benchmarks. Test under realistic conditions. Measure what actually matters to users, not what's easy to measure.

This framing shifts the conversation from "AI good vs. AI bad" to something more productive: "what would it take to actually verify these claims?" Jepsen didn't kill distributed databases — it forced vendors to fix real bugs and made the entire ecosystem more honest. Kingsbury is implicitly arguing that AI needs its own Jepsen moment.

The community response on Hacker News reflects this nuance. The highest-rated comments aren't anti-AI screeds but rather experienced engineers sharing their own verification failures — cases where AI tools produced plausible output that passed code review and made it to production before the bugs surfaced. The pattern Kingsbury describes isn't theoretical; it's happening in production systems right now.

There's also a structural argument buried in the essays that deserves attention: the economic incentives in AI are currently optimized for impressive demos, not reliable outputs. AI companies raise funding based on capability demonstrations. They ship features based on benchmark improvements. The entire feedback loop rewards fluency over correctness. Until customers start demanding (and paying for) verifiable reliability, the gap between claimed and actual performance will persist.

What This Means for Your Stack

If you're integrating AI tools into your development workflow — and most teams are, at this point — Kingsbury's analysis suggests three concrete practices:

First, treat AI output as untrusted input. The same way you wouldn't deploy a third-party library without reviewing it, don't merge AI-generated code without the same scrutiny you'd apply to a junior developer's PR. This means your team needs sufficient expertise to catch the kinds of subtle errors AI produces. If your reviewers can't spot an incorrect edge case in generated code, the tool is a net negative.

Second, measure the full cycle, not just generation speed. If you're evaluating AI coding tools, track time from task start to merged-and-deployed, including review iterations and bugs found post-merge. The generation step is the easy part; the verification step is where the real cost lives.

Third, be skeptical of safety claims from any AI vendor. If a vendor tells you their model is "safe" or "reliable," ask what exactly they tested, under what conditions, and who verified it. Apply the Jepsen standard: independent testing under adversarial conditions, with results published publicly.

Looking Ahead

Kingsbury's essays arrive at a moment when the AI industry is simultaneously claiming that models are safe enough for high-stakes deployment and productive enough to replace significant portions of engineering work. These two claims exist in tension — if the models were truly reliable, the safety question would be simpler; if safety is genuinely hard, the productivity claims need heavy caveats. Aphyr has done what he does best: look at what's actually happening instead of what's being claimed, and document the gap. Whether the industry responds the way database vendors eventually responded to Jepsen — by actually fixing things — or whether it doubles down on benchmarkmanship remains an open question. But the evidence is now on the record.

Aphyr Applies Jepsen-Grade Rigor to AI Hype. It Doesn't Survive.

// tldr

// viewpoints

// deep dive

What Happened

The Safety Argument

The Work Argument

Why This Matters More Than Another AI Skepticism Post

What This Means for Your Stack

Looking Ahead

// read from source

The Future of Everything Is Lies, I Guess: Safety

The future of everything is lies, I guess: Work

// community takes

Aphyr Applies Jepsen-Grade Rigor to AI Hype. It Doesn't Survive.

// tldr

// viewpoints

// deep dive

What Happened

The Safety Argument

The Work Argument

Why This Matters More Than Another AI Skepticism Post

What This Means for Your Stack

Looking Ahead

// read from source

The Future of Everything Is Lies, I Guess: Safety

The future of everything is lies, I guess: Work

// community takes

// share this