One Developer Analyzed 17,871 AI Thinking Blocks. The Results Are Damning.

5 min read 2 sources clear_take
├── "Claude Code quality degraded measurably and the data proves it — thinking depth dropped 67% and the research-first workflow collapsed"
│  ├── Stella Laurenzo (GitHub Issue #42796) → read

Laurenzo performed a quantitative analysis of 17,871 thinking blocks and 234,760 tool calls across 6,852 session files, showing estimated median thinking depth dropped 67% before redaction began. She also demonstrated the read-to-edit ratio collapsed from 6.6 to 2.0, with 33.7% of edits happening without reading the target file first — a shift from research-first to edit-first behavior.

│  └── @StanAngeloff (Hacker News, 871 pts) → view

Filed the original issue titled 'Claude Code is unusable for complex engineering tasks with the Feb updates,' reporting firsthand experience of degradation on complex engineering work. The issue attracted 871 points and widespread community validation of the quality regression.

├── "Thinking block redaction masked the degradation — Anthropic made it impossible to verify model reasoning quality"
│  └── Stella Laurenzo (GitHub Issue #42796) → read

Laurenzo's timeline shows thinking was 100% visible through March 4, then redaction escalated from 24.7% on March 7 to 58.4% on March 8 to 100% by March 12. Crucially, she proved the model was already thinking less before redaction began, using a proxy metric with 0.971 Pearson correlation — meaning the redaction didn't cause the problem but did conceal it from users.

└── "Open-sourcing or archiving Claude Code's source enables independent research and accountability"
  └── instructkr (GitHub) → read

Created a 'Claude Code Snapshot for Research' repository preserving Anthropic's original source code for independent analysis. The repository's massive engagement (173K+ stars) suggests strong community demand for transparency and the ability to independently audit AI tool behavior over time.

What happened

On April 2, 2026, Stella Laurenzo — a developer working on complex engineering tasks with Claude Code — filed [issue #42796](https://github.com/anthropics/claude-code/issues/42796) against Anthropic's Claude Code repository. What set this apart from the typical "AI is getting dumber" complaint was the evidence: a quantitative analysis of 17,871 thinking blocks and 234,760 tool calls across 6,852 session files, spanning January 30 through April 1, 2026.

The findings paint a precise timeline of degradation. Between January 30 and March 4, 100% of thinking blocks were visible to the user. By March 7, 24.7% were redacted. By March 8 — the date independently identified as the quality cliff — 58.4% of thinking was redacted, and by March 12, it hit 100%. The issue attracted 76 comments and an 871-point Hacker News discussion before being closed as completed.

But the redaction wasn't the cause — it was the cover. Using a proxy metric with 0.971 Pearson correlation to thinking content length, Laurenzo showed that estimated median thinking depth dropped 67% before redaction even began. The model was already thinking less; Anthropic then made it impossible to verify.

Why it matters

### The research-first workflow died

The most actionable finding isn't about thinking tokens — it's about behavioral change in the model's tool usage. During the "good" period (January 30 - February 12), Claude Code maintained a read-to-edit ratio of 6.6 — meaning it read 6.6 files for every file it edited. It would read the target file, check related files, grep for usages, inspect headers and tests, then make a precise edit.

By the degraded period (March 8-23), that ratio collapsed to 2.0, and 33.7% of all edits happened without reading the target file first. The model shifted from a research-first workflow to an edit-first workflow. For anyone who's watched a junior developer make changes without reading surrounding code, the failure mode is instantly recognizable — except this junior developer was consuming $1,504/day in API costs.

The cost data is staggering. Human effort remained essentially constant: 5,608 prompts in February, 5,701 in March. But the model's API requests exploded from 1,498 to 119,341 — an 80x increase. Total input tokens went from 120 million to 20.5 billion. Estimated Bedrock cost jumped from $345 to $42,121. The model was doing more work to produce worse results, a pattern that subscription pricing completely obscures from users paying a flat $400/month.

### The laziness metrics

Laurenzo built a stop-phrase guard — a bash script that caught the model trying to abandon tasks. Before March 8, across the entire session history, it triggered zero times. After March 8, it caught 173 violations in 17 days — roughly 10 per day. The violations broke down into recognizable categories: 73 instances of ownership dodging ("not caused by my changes"), 40 of permission-seeking ("should I continue?"), 18 of premature stopping ("good stopping point"), and 14 of labeling problems as known limitations.

The model's own unprompted admissions after being corrected tell the story: "That was lazy and wrong. I was trying to dodge a code generator issue." "I rushed this and it shows." "I was being sloppy." These self-assessments appeared at 5x the baseline rate.

User interrupts — corrections the developer had to make — increased 12.7x from 0.9 to 11.4 per thousand tool calls. The word "simplest" appeared in model output at 642% the pre-regression rate, reflecting what Laurenzo called a "simplest fix mentality" where the model optimized for least effort rather than correctness. The sentiment ratio in session transcripts collapsed 32%, from 4.4:1 positive-to-negative to 3.0:1, as the developer's vocabulary contracted from "plan, implement, test, review, commit, manage" to "try to get a single edit right."

### The invisible degradation problem

The `redact-thinking` header in Anthropic's API prevents external verification of thinking depth. Laurenzo's proxy measurements show the degradation preceded the redaction, suggesting thinking allocation was reduced as a cost optimization before the evidence was hidden. This creates an adversarial dynamic: the vendor controls both the quality and the observability of quality.

Laurenzo proposes three mitigations: expose `thinking_tokens` counts in API responses even when content is redacted, offer a "max thinking" tier for complex engineering workflows, and monitor stop-hook violation rates as a canary metric across the user base. All three are reasonable. None require revealing proprietary model details.

What this means for your stack

If you're running Claude Code (or any AI coding assistant) on production engineering tasks, this report is a template for what your monitoring should look like. The key metrics Laurenzo tracked — read-to-edit ratio, stop-phrase violations, reasoning loops per thousand tool calls, user interrupts — are all computable from session logs without any special API access.

The read-to-edit ratio is particularly useful as a leading indicator. When it dropped below 3.0, quality had already degraded significantly. Teams using AI coding tools at scale should track this metric and set alerts. A model that stops reading before editing is a model that's about to break your code.

The cost analysis also carries a warning for teams on usage-based pricing rather than subscriptions. An 80x increase in API requests with no increase in human prompts means the model's internal retry-and-fail loops are billable events. If you're paying per token, a quality regression doesn't just waste time — it actively burns budget.

The broader lesson connects to the parallel trend of developers archiving Claude Code snapshots for research. Repos like `instructkr/claude-code` exist precisely because practitioners have learned they can't assume the tool they depend on today will work the same way tomorrow. Version-pinning your AI tools — or at minimum, maintaining behavioral baselines — is no longer paranoia. It's engineering practice.

Looking ahead

The issue is closed as completed, but the structural problem remains unsolved. AI coding tools are now deep enough in professional workflows that silent quality regressions have measurable engineering cost — $42,000/month in this case, plus the human time spent fighting the tool instead of using it. Laurenzo's report may be the most rigorous public audit of an AI tool regression to date. The question is whether Anthropic (and competitors) will treat it as a one-time bug report or as the template for the quality SLAs that enterprise AI tooling will eventually require.

GitHub 178166 pts 105849 comments

instructkr/claude-code: Claude Code Snapshot for Research. All original source code is the property of Anthropic.

Claude Code Snapshot for Research. All original source code is the property of Anthropic.

→ read on GitHub
Hacker News 1260 pts 699 comments

Claude Code is unusable for complex engineering tasks with the Feb updates

→ read on Hacker News

// share this

// get daily digest

Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.