One Engineer Analyzed 6,852 Claude Code Sessions. The Da...

What happened

On April 2, 2026, Stella Laurenzo — an engineer working on IREE compiler infrastructure — filed [issue #42796](https://github.com/anthropics/claude-code/issues/42796) against Anthropic's Claude Code repository. It is not a typical bug report. It is a 6,852-session forensic autopsy of how Claude Code's behavior changed between January and April 2026, backed by 234,760 analyzed tool calls, 17,871 thinking blocks, and 18,000+ user prompts across four active projects.

The core claim: Claude Code became structurally incapable of complex engineering work after Anthropic reduced extended thinking token allocation, and the timeline of degradation maps precisely to the rollout of thinking content redaction starting March 8. The issue hit 1,185 points on Hacker News within days. As of publication, it has 97 comments on GitHub.

The analysis isn't vibes. Laurenzo built a programmatic `stop-phrase-guard.sh` hook that catches undesirable model behaviors in real-time — things like ownership dodging ("not caused by my changes"), permission-seeking ("should I continue?"), and premature stopping ("good stopping point"). Before March 8: zero violations across the entire usage history. After March 8: 173 violations in 17 days, peaking at 10 per day.

Why it matters

### The thinking depth collapse

The report's most technically interesting finding uses a clever proxy. Since Anthropic redacted thinking content starting in February, Laurenzo couldn't directly measure thinking depth after the cutover. Instead, she correlated a "signature field" in API responses with thinking token counts from the pre-redaction period, establishing a 0.971 Pearson correlation coefficient. Using this proxy, she estimates median thinking depth dropped from ~2,200 characters to ~600 characters — a 73% reduction.

Critically, the thinking depth started collapsing before redaction was even visible to users. During the January 30–February 8 baseline, estimated median thinking was ~2,200 characters. By late February — while thinking content was still fully visible — it had already dropped to ~720 characters. The redaction rollout merely hid a regression that was already underway.

### The behavioral fingerprint

What makes this report devastating is not any single metric but the convergent evidence across dozens of independent measurements. The model's Read:Edit ratio — files read per file edited — dropped from 6.6 to 2.0. In plain terms: during the "good" period, Claude read nearly 7 files for every file it modified. During the degraded period, it read 2. One-third of all edits in the degraded period were made to files the model hadn't opened — up from 6.2% in the baseline.

The reasoning quality indicators tell the same story from a different angle. Visible self-corrections ("oh wait," "actually," "let me reconsider") rose from 8.2 to 26.6 per thousand tool calls. The model's use of the word "simplest" — a proxy for reaching for the cheapest solution rather than the correct one — increased from 2.7 to 6.3 per thousand tool calls. In one captured instance, Claude later acknowledged its own behavior: "You're right. That was lazy and wrong. I was trying to dodge a code generator issue instead of fixing it."

User sentiment metrics collapsed in parallel. The word "great" dropped 47%. "Thanks" dropped 55%. Meanwhile, "stop" rose 87%, "lazy" rose 93%, and "simplest" — used sarcastically by the frustrated user — rose 642%. The positive-to-negative sentiment ratio fell from 4.4:1 to 3.0:1.

### The cost explosion

Perhaps the most alarming number: estimated Bedrock costs went from $345 in February to $42,121 in March — a 122x increase — while the human put in essentially identical effort (5,608 vs 5,701 prompts). The model compensated for its reduced reasoning capacity by making dramatically more API requests (1,498 → 119,341, an 80x increase) and consuming 170x more input tokens through repetitive context re-reads. It traded thinking for thrashing.

This cost pattern is the inverse of what users expect from a "more efficient" model. Anthropic may have reduced per-request thinking costs, but the downstream effect was massively increased request volume as the model failed and retried.

The ecosystem paradox

Here's what makes the timing surreal: while power users are documenting severe quality regressions, the Claude Code ecosystem is expanding rapidly. Repositories like [instructkr/claude-code](https://github.com/instructkr/claude-code) — a research snapshot of the Claude Code source with a signal score of 175K — continue trending on GitHub. New Claude Code skills are proliferating, like [nothing-design-skill](https://github.com/dominikmartn/nothing-design-skill), which generates UI in Nothing's monochrome industrial design language.

This isn't contradictory — it's a pattern we've seen before. The developer ecosystem around a tool often grows fastest right when trust in the core product is most fragile, because ecosystem growth is driven by adoption breadth while quality complaints come from depth. Casual users trying Claude Code for the first time and building custom skills don't hit the regression walls that a compiler engineer running multi-hour sessions across four interconnected projects does. The tool's surface is expanding while its floor is dropping.

What this means for your stack

If you're using Claude Code for complex, multi-file engineering work — the kind where context accumulation and careful reading-before-writing matter — the data here is clear enough to act on.

First, instrument your own sessions. Laurenzo's stop-phrase guard approach is replicable: a bash hook that greps model output for ownership-dodging and premature-stopping patterns. If your violation rate starts climbing, you have an early warning system that doesn't depend on Anthropic's transparency about model changes.

Second, monitor your Read:Edit ratio. This is the single most actionable metric in the report. A healthy Claude Code session should read significantly more than it writes. If you see the ratio approaching 2:1 or lower, the model is editing blind. You can approximate this from session logs without any special tooling.

Third, watch your token costs relative to output quality. A 122x cost increase for equivalent human effort is not a billing anomaly — it's a behavioral signature of a model that's lost the ability to plan before acting. If your Claude Code bills are climbing while your satisfaction is dropping, you're likely hitting the same regression.

The report's recommendation for Anthropic is worth amplifying: expose `thinking_tokens` in API usage responses. Without this, users cannot distinguish between "the model thought carefully and reached a good answer" and "the model barely thought and got lucky." Opacity about reasoning depth is, as this report demonstrates, opacity about product quality.

Looking ahead

The issue is now closed as "completed," though the nature of the resolution isn't clear from the public record. What is clear is that Laurenzo has established a benchmark methodology that any power user can replicate. The 0.971 correlation between signature fields and thinking depth means the community now has a proxy metric that works even when Anthropic redacts thinking content. If the regression persists — or recurs — it won't take 6,852 sessions and a viral GitHub issue to detect it next time. The guardrails are open-source, the methodology is published, and 97 commenters are watching.

One Engineer Analyzed 6,852 Claude Code Sessions. The Data Is Damning.

// tldr

// viewpoints

// deep dive

What happened

Why it matters

The ecosystem paradox

What this means for your stack

Looking ahead

// read from source

instructkr/claude-code: Claude Code Snapshot for Research. All original source code is the property of Anthropic.

dominikmartn/nothing-design-skill: A Claude Code skill for generating UI in the Nothing design language. Monochrome, typographic, industrial.

Claude Code is unusable for complex engineering tasks with the Feb updates

One Engineer Analyzed 6,852 Claude Code Sessions. The Data Is Damning.

// tldr

// viewpoints

// deep dive

What happened

Why it matters

The ecosystem paradox

What this means for your stack

Looking ahead

// read from source

instructkr/claude-code: Claude Code Snapshot for Research. All original source code is the property of Anthropic.

dominikmartn/nothing-design-skill: A Claude Code skill for generating UI in the Nothing design language. Monochrome, typographic, industrial.

Claude Code is unusable for complex engineering tasks with the Feb updates

// share this