Kingsbury argues that AI safety benchmarks measure narrow, easily-gamed metrics rather than real-world harm — a model can ace a multiple-choice safety eval while still hallucinating medical advice or fabricating legal citations. He draws a direct parallel to his Jepsen work, where database vendors claimed 'ACID compliance' based on internal tests that crumbled under realistic failure conditions, suggesting AI labs are repeating the same pattern of substituting marketing for engineering rigor.
In the companion 'Work' essay, Kingsbury extends his evidence-based skepticism to AI productivity narratives, applying the same methodological rigor he uses in Jepsen to scrutinize whether AI tools deliver on their promises in practice. His credibility as someone who professionally stress-tests vendor claims — rather than an AI doomer or Luddite — gives this critique particular weight among technical audiences.
The editorial emphasizes that Kingsbury isn't positioned as an AI doomer but as the person companies hire to verify whether their systems actually work as advertised. With a combined 520 HN points across both posts, the arguments resonated well beyond the usual AI-skeptic crowd, suggesting the technical community finds Jepsen-style empirical critique more persuasive than ideological opposition to AI.
Kingsbury's core structural argument across both essays is that AI labs are following the exact playbook database vendors used: publish self-serving benchmarks, claim compliance with safety or performance standards based on narrow internal tests, and rely on the complexity of the systems to prevent customers from verifying the claims independently. Jepsen repeatedly showed databases losing data despite 'strong consistency' marketing; he argues AI safety evaluations are the same kind of theater.
Kyle Kingsbury — better known as aphyr, the engineer behind Jepsen, the distributed systems correctness testing suite that has found critical bugs in nearly every database it has examined — published a two-part essay series titled "The Future of Everything Is Lies, I Guess." The first installment covers Safety, the second covers Work. Together, they form the most technically grounded critique of AI industry claims published this year.
The posts landed on Hacker News with a combined score of 520 points, which for aphyr's characteristically long-form, evidence-heavy writing style, signals that the arguments resonated well beyond the usual AI-skeptic crowd. Kingsbury isn't an AI doomer or a Luddite — he's the person companies hire when they want to know if their database actually works as advertised. That credibility is what makes this series hit differently than the average "AI bad" blog post.
Kingsbury's safety critique draws a direct line from his Jepsen work to AI safety claims. His central observation: the AI industry has adopted the word "safety" as a marketing term while systematically undermining the engineering practices that would make systems actually safe.
He examines how AI labs publish safety benchmarks that measure narrow, easily-gamed metrics rather than real-world harm. A model can score well on a multiple-choice safety evaluation while still confidently hallucinating medical advice, fabricating legal citations, or generating instructions for harm when prompted with modest creativity. The benchmarks test whether the model can identify the "correct" answer on a test, not whether it behaves safely when deployed to millions of users with unpredictable inputs.
This parallels what Jepsen found in the database world: vendors would claim "ACID compliance" or "strong consistency" based on their own internal tests, only for Jepsen to reveal data loss and consistency violations under realistic failure conditions. The gap between claimed and actual behavior wasn't a bug — it was a business model. Kingsbury argues AI safety is following the same playbook.
He also notes the organizational signals: safety teams at major AI labs have been repeatedly restructured, downsized, or overruled when their findings conflicted with shipping timelines. When the people whose job is to say "this isn't ready" keep getting moved out of the critical path, the system is telling you what it actually values.
The second post turns the same analytical lens on productivity claims. Kingsbury tests AI coding assistants against his own real-world tasks — not toy examples or LeetCode problems, but the kind of gnarly systems programming that working engineers actually do.
His findings will be familiar to anyone who has used these tools seriously: AI-generated code is fluent, plausible, and wrong in ways that are often harder to detect than writing the code yourself. The output looks correct at a glance. It uses appropriate variable names, follows conventions, and structures code in familiar patterns. But the logic contains subtle errors — off-by-one mistakes, incorrect edge case handling, misunderstood API contracts — that require careful line-by-line review to catch.
This creates a paradox that Kingsbury identifies clearly: the time saved by generating code is consumed (and often exceeded) by the time required to verify it. For experienced engineers who can spot these errors, AI assistants provide modest utility for boilerplate and exploration. For junior engineers who lack the expertise to recognize subtle bugs, the tools are actively dangerous — they produce a false sense of progress while embedding defects that surface later.
The productivity studies cited by AI companies tend to measure speed of initial code generation, not the full lifecycle cost including review, debugging, and maintenance. Kingsbury compares this to measuring a database's write throughput without measuring whether the data is actually there when you read it back.
The tech industry has no shortage of AI criticism. What makes Kingsbury's contribution distinct is methodological: he applies the same framework that made Jepsen valuable. Don't trust the vendor's benchmarks. Test under realistic conditions. Measure what actually matters to users, not what's easy to measure.
This framing shifts the conversation from "AI good vs. AI bad" to something more productive: "what would it take to actually verify these claims?" Jepsen didn't kill distributed databases — it forced vendors to fix real bugs and made the entire ecosystem more honest. Kingsbury is implicitly arguing that AI needs its own Jepsen moment.
The community response on Hacker News reflects this nuance. The highest-rated comments aren't anti-AI screeds but rather experienced engineers sharing their own verification failures — cases where AI tools produced plausible output that passed code review and made it to production before the bugs surfaced. The pattern Kingsbury describes isn't theoretical; it's happening in production systems right now.
There's also a structural argument buried in the essays that deserves attention: the economic incentives in AI are currently optimized for impressive demos, not reliable outputs. AI companies raise funding based on capability demonstrations. They ship features based on benchmark improvements. The entire feedback loop rewards fluency over correctness. Until customers start demanding (and paying for) verifiable reliability, the gap between claimed and actual performance will persist.
If you're integrating AI tools into your development workflow — and most teams are, at this point — Kingsbury's analysis suggests three concrete practices:
First, treat AI output as untrusted input. The same way you wouldn't deploy a third-party library without reviewing it, don't merge AI-generated code without the same scrutiny you'd apply to a junior developer's PR. This means your team needs sufficient expertise to catch the kinds of subtle errors AI produces. If your reviewers can't spot an incorrect edge case in generated code, the tool is a net negative.
Second, measure the full cycle, not just generation speed. If you're evaluating AI coding tools, track time from task start to merged-and-deployed, including review iterations and bugs found post-merge. The generation step is the easy part; the verification step is where the real cost lives.
Third, be skeptical of safety claims from any AI vendor. If a vendor tells you their model is "safe" or "reliable," ask what exactly they tested, under what conditions, and who verified it. Apply the Jepsen standard: independent testing under adversarial conditions, with results published publicly.
Kingsbury's essays arrive at a moment when the AI industry is simultaneously claiming that models are safe enough for high-stakes deployment and productive enough to replace significant portions of engineering work. These two claims exist in tension — if the models were truly reliable, the safety question would be simpler; if safety is genuinely hard, the productivity claims need heavy caveats. Aphyr has done what he does best: look at what's actually happening instead of what's being claimed, and document the gap. Whether the industry responds the way database vendors eventually responded to Jepsen — by actually fixing things — or whether it doubles down on benchmarkmanship remains an open question. But the evidence is now on the record.
"Alignment"In what world would I ever expect a commercial (or governmental) entity to have precise alignment with me personally, or even with my own business? I argue those relationships are necessarily adversarial, and trusting anyone else to align their "AI" tool to my goals, n
In short, the ML industry is creating the conditions under which anyone with sufficient funds can train an unaligned model. Rather than raise the bar against malicious AI, ML companies have lowered it.This is true, and I believe that the "sufficient funds" threshold will keep dropping too.
> "Unavailable Due to the UK Online Safety Act"Anyone outside the UK can share what this is about?
Previous discussions from earlier posts on the topic:* https://news.ycombinator.com/item?id=47703528* https://news.ycombinator.com/item?id=47730981
Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.
Other articles in this series discussed over the past five days:1. Introduction: <https://news.ycombinator.com/item?id=47689648> (619 comments)2. Dynamics: <https://news.ycombinator.com/item?id=47693678> (0 comments)3. Culture: <https://news.yco