The S-curve hits: AI capability gains stall while burn rates accelerate

5 min read 1 source clear_take
├── "AI capability gains are decelerating while costs explode, creating an unsustainable business model"
│  └── Ed Zitron (wheresyoured.at) → read

Zitron argues that frontier model releases (GPT-5, Claude 4.5 Sonnet, Gemini 3, Grok 4) are delivering smaller capability deltas than prior generations, with SWE-Bench scores moving from ~65% to ~75% rather than doubling like GPT-3.5 to GPT-4. He points to ~18 months of optimizing the same architecture against the same benchmarks while OpenAI burns $40B/year and hyperscalers commit $500B+ in datacenter capex, framing this as a fundamental mismatch between flattening returns and escalating spend.

├── "The real risk is enterprise demand collapse if marginal users stop noticing upgrades"
│  └── top10.dev editorial (top10.dev) → read below

The editorial reframes the debate away from whether progress is technically slowing toward what happens to the industry's business model when the marginal user stops perceiving improvements. This regime — where each release is real but imperceptible to the daily user — undermines the enterprise upgrade cycle that justifies the capex.

└── "Day-to-day developer experience confirms diminishing returns don't justify the energy footprint"
  └── @crescit_eundo (Hacker News, 575 pts) → view

The HN thread's 575 points reflect a consensus among working developers comparing their own usage: Sonnet 4.5 is genuinely better than 3.5, but the delta is incremental rather than transformative. Commenters argue the marginal improvement cannot justify the energy footprint of a small European country, treating their lived workflow experience as more credible than benchmark deltas.

What happened

Ed Zitron's June 8 post "AI Is Slowing Down" lays out a now-familiar but increasingly hard-to-rebut argument: the major frontier labs are shipping models faster than ever, and each release is delivering a smaller capability delta than the one before. The piece runs through the release timeline — GPT-5 in mid-2025, Claude 4.5 Sonnet, Gemini 3, Grok 4 — and notes that none produced the kind of step-change reaction that GPT-4 did in March 2023, when developers genuinely rewrote their workflows in a weekend.

The post leans heavily on benchmarks and on the labs' own framing. SWE-Bench Verified scores have moved from ~65% (Claude 3.5 Sonnet, mid-2024) to ~75% (current frontier) — a real gain, but not the doubling we got from GPT-3.5 to GPT-4. Hallucination rates on factual recall are roughly flat. Long-context retrieval is better but still unreliable past 200K tokens in adversarial tests. Zitron's read: the labs have spent ~18 months optimizing the same architecture against the same benchmarks, and the curve is bending.

Meanwhile the spending isn't bending. OpenAI's projected 2026 burn is north of $40B. Anthropic has raised at a $170B valuation against ~$5B ARR. Microsoft, Meta, Google, and Amazon are collectively committed to over $500B in datacenter capex through 2027. The HN thread (575 points, top-of-page when this was written) is mostly developers comparing their own day-to-day experience and concluding that, yes, Sonnet 4.5 is better than 3.5, but it's not better in a way that justifies the energy footprint of a small European country.

Why it matters

The interesting question isn't whether Zitron is right that progress is slowing — reasonable people can fight about that benchmark by benchmark. The interesting question is what happens to the industry's business model if the *marginal* user stops noticing the upgrades. That's the regime we may already be in.

For the first two years of the LLM boom, the pitch to enterprise was "buy now because the next model will obsolete your current integration." That fear of being left behind is what justified six- and seven-figure annual contracts for capability that, in many cases, a 70B open model running on a rented H100 could approximate. If the gap between frontier and open stops widening — and Qwen 3, DeepSeek V3, and Llama 4 suggest it's narrowing in many domains — then the premium for API access compresses fast. OpenAI's own pricing moves bear this out: GPT-5 launched cheaper per token than GPT-4 did, not more expensive.

The counter-argument, well-represented in the HN comments, is that we're measuring the wrong thing. Agentic workloads — long-horizon, tool-using, self-correcting — are where the real frontier action is, and benchmarks like SWE-Bench Verified don't capture it well. There's something to this. The jump from "single-shot code completion" to "autonomously close a GitHub issue end-to-end" is the kind of qualitative shift that doesn't show up in token-level metrics. But agentic reliability is also where the most embarrassing failures live: Devin's revenue per active customer remains unclear, Cursor's agent mode burns through tokens faster than humans can review the output, and Anthropic's own Claude Code traces show recovery loops that would horrify any SRE.

Zitron's sharpest point, and the one underplayed in the post, is that capex doesn't depreciate on the same curve as capability gains. H100s have a five-year useful life on the books and a ~two-year useful life in practice as Blackwell and Rubin land. Datacenters have 30-year amortization schedules. If the model that justifies a $10B training run in 2026 is only 8% better than the one from 2025, and the one from 2025 is already commoditized by open weights, the math gets ugly fast. This is the same dynamic that ended the dot-com fiber buildout — Level 3 and Global Crossing weren't wrong about demand, they were wrong about who would capture the margin.

What this means for your stack

If you're building on top of frontier APIs, the practical takeaway is that you should stop optimizing your prompts for the model you're using today. The half-life of a tuned prompt is now shorter than the procurement cycle to swap providers. Build an abstraction layer. Test the same workload against Sonnet, GPT-5, Gemini 3, and at least one open model monthly. The cost of doing this is a weekend; the cost of being locked in when your vendor raises prices 3x to chase profitability is your runway.

Second: stop assuming the next model will fix your current quality problems. It won't, or at least not enough to matter. The gains from RAG hygiene, eval suites, and constrained output formats now exceed the gains from waiting six months for the next checkpoint. The teams shipping useful AI products in mid-2026 are the ones who treated the model as a fixed component twelve months ago and put their effort into the surrounding system. The teams still waiting for AGI to fix their hallucination problem are the ones whose Series B is getting harder to raise.

Third, and this is where Zitron is genuinely useful even if you think he's too bearish: the people selling you the next platform shift have a strong incentive to undersell the plateau. When NVIDIA, Microsoft, OpenAI, and your local AI consultancy all agree that the next model will change everything, ask what their book looks like if it doesn't. Most of them have answers. Some of them don't.

Looking ahead

The honest version of the next 18 months probably isn't "AI winter" and it isn't "AGI by Christmas." It's a slow-grinding consolidation where the frontier labs converge on similar capability profiles, pricing pressure from open models compresses margins, and the value capture moves up the stack — toward applications, tooling, evals, and the unglamorous middleware that makes models actually usable in production. That's a worse story for the trillion-dollar valuations and a better story for the developers who are tired of rewriting their integrations every six months. Whether the capex cycle can absorb that transition without something breaking is the question Zitron is really asking. The answer probably arrives in someone's Q3 earnings call.

Hacker News 619 pts 693 comments

AI Is Slowing Down

→ read on Hacker News
Eighth · Hacker News

Number of active users on ChatGPT is at an all-time high. Number of tokens consumed on OpenRouter is at an all-time high. I'm not seeing the plateau.

jollyllama · Hacker News

Lots of dismissive comments ITT, very few tackling the substance of the article.> AI Cannot Afford To Slow Down — It Needs $3 Trillion Or More In Revenue By End Of 2030 To Sustain Its ExistenceIs this true? With the total 2024 wages being 11.7 trillion USD [0], and nonfarm payrolls totaling 158,0

adamtaylor_13 · Hacker News

Ed is an interesting character. His financial analysis of the AI industry makes logical sense to me (though I am not knowledgeable enough to actually know if it is correct.) However, he seems to be so angry at AI in general, that he misses the obvious areas where LLMs are actually changing the State

dofm · Hacker News

Today Apple launched its revamped AI offering. Judging by several reports, Apple pays Google a mere billion dollars a year to operate it. Essentially just licensing the IP. Google are (allegedly) happy to turn over the right to operate and distill their models for only a billion a year.Consumer reve

putzdown · Hacker News

One of the "smells" that gives away a quacky ranter is they speak in impassioned, "Why doesn't everyone understand this?" tones, but in fact their argument just doesn't flow. If Zitron's argument were as solid as he keeps saying it is, you would read it and underst

// share this

// get daily digest

Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.