Kingsbury argues that LLMs are 'bullshit machines' in the philosophical sense — systems architecturally indifferent to truth. Drawing on his career testing distributed systems for correctness violations, he applies the same scrutiny to LLMs and concludes their fundamental design produces outputs uncoupled from factual grounding, making them dangerous infrastructure for society.
Kingsbury traces his concern back to 2019, when he publicly asked a hyperscaler whether making deep learning cheaper would enable new forms of spam and propaganda. He frames the current landscape as a predictable consequence of prioritizing capability over ethics, and positions his essay as filling the 'negative space' that launch keynotes deliberately omit.
The editorial argues that Kingsbury's Jepsen work — which has publicly embarrassed major databases by proving their consistency guarantees didn't hold — establishes a unique authority for this critique. When someone whose career is built on ruthlessly testing systems' truth claims says LLMs are architecturally truth-indifferent, it carries specific technical credibility that generic AI skepticism does not.
Kingsbury explicitly frames his essay as intentionally unbalanced, acknowledging it is 'neither balanced nor complete.' He argues that boosterism needs no amplification and that others have better covered ecological and IP concerns, so his contribution should focus exclusively on mapping risks and failure modes — the negative space that isn't represented in mainstream AI coverage.
Kyle Kingsbury — better known as Aphyr, the person behind Jepsen, the gold-standard test suite that has found correctness bugs in virtually every major distributed database — has released a multi-part essay titled *The Future of Everything Is Lies, I Guess*. Available as a series of blog posts plus PDF and EPUB, the piece represents years of deferred writing on the social and technical implications of large language models.
Kingsbury opens with a disarming admission: he grew up on Asimov and Clarke, dreamed of intelligent machines, and never imagined the Turing test would fall in his lifetime. He also never imagined he'd feel so disheartened when it did. The essay traces his skepticism back to 2019, when he asked a hyperscaler presenting new LLM training hardware whether what they were doing was ethical — whether making deep learning cheaper would enable new forms of spam and propaganda. Five years later, the essay finally exists, and it is, in his own words, "bullshit about bullshit machines."
The piece is deliberately one-sided. Kingsbury acknowledges that others have covered ecological and intellectual property dimensions more thoroughly, and that boosterism needs no additional amplification. His goal is to map the negative space — the risks and failure modes that don't make it into launch keynotes.
This isn't a random blog post from a concerned citizen. Kingsbury's entire career has been built on one principle: systems that claim correctness properties should be tested against those claims, ruthlessly and publicly. Jepsen has embarrassed Redis, MongoDB, Elasticsearch, CockroachDB, and dozens of other databases by demonstrating that their consistency guarantees didn't hold under real failure conditions. When that person turns their attention to LLMs and says the fundamental architecture is truth-indifferent, it carries a specific weight.
The essay's core argument — that LLMs are bullshit machines in the philosophical sense, producing outputs without regard to truth value — isn't new. Harry Frankfurt's *On Bullshit* framework has been applied to language models since GPT-3. What Kingsbury adds is the systems-thinking perspective: what happens when you deploy truth-indifferent components into truth-dependent pipelines? In distributed systems, a single node that lies about its state can corrupt an entire cluster. The analogy to LLM-generated code, documentation, legal filings, and medical advice is not subtle.
The Hacker News discussion around the essay is itself instructive. Commenter danieltanfh95 pushed back on the "LLMs can't do X so they're idiots" framing, arguing that LLMs with harnesses — tool use, retrieval augmentation, chain-of-thought scaffolding — are "clearly capable of engaging with logical problems that only need text." This is the strongest version of the counterargument: nobody serious claims raw token prediction is reasoning, but the composite systems built around LLMs may be.
The most interesting tension in the discourse isn't between "AI works" and "AI doesn't work" — it's between people building verification layers fast enough and people deploying without them. Commenter munificent drew a parallel to the Industrial Revolution: before industrialization, the natural world was nearly infinitely abundant relative to our capacity to exploit it. LLMs may have done something similar to information — made the generation of plausible-sounding text so cheap that we've overwhelmed our capacity to verify it.
Meanwhile, commenter beders highlighted the terminological problem: the phrase "AI" is so overloaded that conversations about capabilities, risks, and ethics constantly talk past each other. When a product marketer says "AI" and a machine learning researcher says "AI" and Kyle Kingsbury says "AI," they're describing different things — and the ambiguity is not accidental.
If you're using LLM-generated code in production — and at this point, most teams are — Kingsbury's essay is a useful forcing function to audit your verification pipeline. The question isn't whether Copilot or Claude or GPT wrote the code. The question is whether your review process, test coverage, and deployment safeguards were designed for a world where a substantial fraction of submitted code was generated by a system that optimizes for plausibility rather than correctness.
Concretely, this means:
Testing budgets need to account for LLM-generated code. If 30-40% of your new code is AI-assisted (GitHub's reported Copilot acceptance rate), your test suite needs to cover failure modes that human developers rarely produce but LLMs produce routinely — subtly wrong boundary conditions, hallucinated API surfaces, correct-looking code that breaks under concurrency. Property-based testing and fuzzing become more valuable, not less, in an LLM-assisted workflow.
Code review norms need updating. The traditional code review assumes a human author who understands the code's intent and can explain their reasoning when questioned. When the author is a human who accepted a suggestion from a system that has no intent, the review dynamic changes. Some teams are experimenting with requiring reviewers to run AI-generated code locally before approving, or flagging AI-assisted PRs for additional scrutiny. Neither approach scales perfectly, but doing nothing scales worse.
Observability matters more. Kingsbury's Jepsen work proved that distributed systems fail in ways their authors didn't anticipate. LLM-generated code fails the same way — it's syntactically valid, it passes the obvious tests, and it breaks in production under conditions the model never saw in training. If you're not already running comprehensive observability on LLM-assisted codepaths, you're flying blind in exactly the way Kingsbury has spent a decade warning database vendors about.
The broader point extends beyond code. If your product uses LLM outputs for customer-facing content, search results, documentation, or decision support, the verification layer is your product's integrity. The LLM is a generation engine. Verification is your job.
Kingsbury's essay is the first installment of a series, with subsequent sections releasing over the coming days. Given his track record — Jepsen reports are exhaustive, well-sourced, and devastating to their subjects — the full work is likely to become a reference text for the LLM-skeptic position. Whether you agree with his framing or not, the systems-level question he's asking is the right one: we know how to build reliable systems from unreliable components (it's literally what distributed computing is), but only when we acknowledge the unreliability upfront rather than marketing it away. The industry's track record on that front, as Jepsen has documented across dozens of databases, is not encouraging.
> 2017’s Attention is All You Need was groundbreaking and paved the way for ChatGPT et al. Since then ML researchers have been trying to come up with new architectures, and companies have thrown gazillions of dollars at smart people to play around and see if they can make a better kind of model.
> It remains unclear whether continuing to throw vast quantities of silicon and ever-bigger corpuses at the current generation of models will lead to human-equivalent capabilities. Massive increases in training costs and parameter count seem to be yielding diminishing returns. Or maybe this effec
I think the discussion has to be more nuanced than this. "LLMs still can't do X so it's an idiot" is a bad line of thought. LLMs with harnesses are clearly capable of engaging with logical problems that only need text. LLMs are not there yet with images, but we are improving with
Thank you for putting it so succinctly.I keep explaining to my peers, friends and family that what actually is happening inside an LLM has nothing to do with conscience or agency and that the term AI is just completely overloaded right now.
Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.
There is a whole giant essay I probably need to write at some point, but I can't help but see parallels between today and the Industrial Revolution.Prior to the industrial revolution, the natural world was nearly infinitely abundant. We simply weren't efficient enough to fully exploit it.