Filed or surfaced the GitHub issue documenting Stella Laurenzo's analysis of 6,852 sessions showing a 67% drop in median thinking depth by late February and 100% redaction by mid-March. The data includes 0.971 Pearson correlation on paired samples, read-to-edit ratio collapse from 6.6 to 2.0, and a timeline consistent with staged deployment of token cuts.
The editorial presents the regression as rigorously documented across multiple independent dimensions — thinking depth, read-to-edit ratios, reasoning loops, and blind edits to unread files. It characterizes the timeline as 'damning,' noting the staged rollout from visible thinking blocks to full redaction between January and March 2026.
Highlights that the read-to-edit ratio dropped from 6.6 to 2.0, meaning the model went from thoroughly reading codebases before making changes to editing files it hadn't recently examined. By March, one in three edits targeted files with no recent context, producing edits that break surrounding code, violate conventions, and splice into comment blocks.
Published a snapshot of Claude Code's source code explicitly for research purposes, enabling independent analysis of how the tool works. This positions open inspection of proprietary AI tooling as essential for the community to verify claims about quality regressions and understand internal behavior changes.
On April 2, 2026, Stella Laurenzo — a senior engineer working on LLVM/IREE compiler infrastructure — filed what may be the most rigorously documented AI quality regression report ever written. GitHub issue #42796 on the anthropics/claude-code repository contains a full quantitative analysis of 6,852 Claude Code session files, 17,871 thinking blocks, 234,760 tool calls, and 18,000+ user prompts spanning January 30 through April 1, 2026.
The thesis is straightforward: Anthropic quietly reduced thinking token allocation starting in mid-February, then began fully redacting thinking blocks in early March, and both changes correlate precisely with measurable quality collapse in complex engineering workflows. The data backing this claim is unusually specific. Using a signature-field proxy (0.971 Pearson correlation on 7,146 paired samples), Laurenzo's analysis estimates median thinking depth dropped 67% by late February — before redaction even began — and stabilized at roughly 75% below baseline by March.
The timeline is damning. On January 30, 100% of thinking blocks were visible. By March 8, 58.4% were redacted. By March 12, redaction hit 100%. The rollout pattern is consistent with staged deployment, not a single switch flip.
This isn't a vibes-based complaint. The behavioral metrics tell a coherent story across multiple independent dimensions.
The read-to-edit ratio — how many files the model reads before making a change — dropped from 6.6 in the good period to 2.0 in the degraded period. In practical terms, the model went from thoroughly researching a codebase before touching it to editing files it hadn't recently read. By March, one in three edits was made to a file the model had no recent context on. The predictable result: edits that break surrounding code, violate file conventions, and splice new code into comment blocks.
Reasoning loops — instances where the model generates a plan, contradicts it, revises, then contradicts the revision — increased from 8.2 per 1,000 tool calls to 26.6. Stop-hook violations (the model trying to quit mid-task with excuses like "good stopping point" or "should I continue?") went from literally zero before March 8 to 173 in the following two weeks, peaking at 43 in a single day. These behaviors were so absent in the good period that the monitoring hook for them was unnecessary.
The vocabulary analysis adds a grimly human dimension. Usage of "great" dropped 47%. "Stop" increased 87%. "Terrible" increased 140%. "Simplest" — which Laurenzo identifies as a signal the model is optimizing for least effort rather than correctness — increased 642%. The positive-to-negative word ratio in user prompts collapsed from 4.4:1 to 3.0:1. The model itself, in self-referential moments, produced statements like "That was lazy and wrong" and "I rushed this and it shows."
Perhaps the most striking finding: estimated daily API costs exploded from $12 in February to $1,504 in March — a 122x increase — while producing demonstrably worse output. The human put in roughly the same effort (similar prompt counts). But the model consumed 80x more API requests and 64x more output tokens. The failure mode wasn't one broken session; it was 10+ concurrent sessions all degrading simultaneously, each requiring human intervention that the multi-agent workflow was designed to eliminate.
This inverts the economic argument for AI coding assistants. If reduced thinking tokens save Anthropic compute costs but multiply the total tokens consumed through retry loops, corrections, and rework, the net compute may actually increase. The user pays more (in time and subscription fees), Anthropic serves more tokens, and the output is worse. Nobody wins.
A parallel signal emerged alongside this issue: a Korean research group published a full source snapshot of Claude Code's internals on GitHub (instructkr/claude-code), which accumulated massive engagement. The existence of a reverse-engineering effort this popular speaks to a broader trust deficit — when users can't see thinking tokens and can't verify reasoning depth, they start decompiling the tool itself.
Laurenzo's proposed solutions are worth noting for their restraint. She doesn't demand unlimited thinking. She asks for three things: transparency about thinking allocation changes, a "max thinking" tier for complex engineering workflows, and thinking token counts in API responses even when content is redacted. The third request is particularly reasonable — it would let power users monitor reasoning depth without exposing proprietary chain-of-thought content.
The time-of-day analysis adds an infrastructure wrinkle. Pre-redaction, thinking depth was flat across the day (±10%). Post-redaction, it varies by up to 17.7%, with the worst performance during peak US work hours (5-7pm PST). This pattern suggests the constraint may be GPU-level resource allocation rather than a deliberate per-user policy — which, if true, means Anthropic is capacity-constrained and rationing thinking tokens dynamically.
If you're running Claude Code for complex, multi-file engineering tasks — particularly compiler work, systems programming, or anything requiring deep codebase understanding — the data suggests you should monitor your own read-to-edit ratios and correction frequency as canary metrics for model quality. A sudden increase in the model editing files without reading them first, or asking to stop mid-task, is a measurable signal that thinking depth has been reduced.
For teams evaluating AI coding tools for senior engineering work, this report raises a structural question: can you build reliable workflows on top of a model whose reasoning depth is silently variable? Laurenzo's multi-agent setup was designed to reduce human intervention. When thinking tokens were cut, it became a human-intervention multiplier instead. The architecture was sound; the foundation shifted underneath it.
The broader lesson applies beyond Anthropic. Any AI provider that controls reasoning depth behind an opaque API can silently degrade complex workflows. Users building production systems on top of AI coding assistants need contractual or technical guarantees about reasoning depth, not just model version pinning. Version pinning means nothing if the same model version gets different thinking budgets on different days.
This issue will likely become a reference document for how to quantitatively evaluate AI coding tool regressions. The methodology — correlating thinking token proxies with behavioral metrics across thousands of sessions — is reproducible by any team with sufficient logging. Whether Anthropic responds with transparency, a power-user tier, or continued silence will signal how seriously the company takes its most demanding users. The engineers filing these issues aren't casual users complaining about chatbot quality. They're the exact audience that validates whether AI coding tools can handle real work. Losing their trust is expensive in ways that don't show up in token economics.
Claude Code Snapshot for Research. All original source code is the property of Anthropic.
→ read on GitHubTop 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.