Laurenzo conducted a forensic analysis of 6,852 sessions with 234,760 tool calls, finding that median thinking depth dropped 73% (from ~2,200 to ~600 characters) after March 8. Her stop-phrase-guard hook detected zero behavioral violations before the change and 173 violations in 17 days after, with degradation mapping precisely to the thinking content redaction rollout.
Filed the Hacker News submission characterizing Claude Code as 'unusable for complex engineering tasks' after the February updates. The post resonated strongly with the developer community, accumulating 1,185 points, suggesting widespread agreement with the regression claim.
Because Anthropic redacted thinking content starting in February, Laurenzo had to develop a proxy metric using a 'signature field' in API responses correlated with pre-redaction thinking token counts (0.971 Pearson coefficient). The fact that an engineer had to reverse-engineer a proxy to measure thinking depth highlights how redaction obscures quality regression from users.
The editorial highlights that thinking depth started collapsing before redaction was visible to users — during the January 30–February 8 baseline period. This timeline suggests the token reduction was a deliberate infrastructure change rather than an accidental regression, raising questions about whether Anthropic traded capability for cost efficiency without disclosure.
Published a research snapshot of Claude Code's original source code, enabling independent analysis of how the tool works and how changes to thinking token allocation might manifest in behavior. The repository's massive engagement (175K+ stars) signals strong community demand for transparency into proprietary AI tooling.
On April 2, 2026, Stella Laurenzo — an engineer working on IREE compiler infrastructure — filed [issue #42796](https://github.com/anthropics/claude-code/issues/42796) against Anthropic's Claude Code repository. It is not a typical bug report. It is a 6,852-session forensic autopsy of how Claude Code's behavior changed between January and April 2026, backed by 234,760 analyzed tool calls, 17,871 thinking blocks, and 18,000+ user prompts across four active projects.
The core claim: Claude Code became structurally incapable of complex engineering work after Anthropic reduced extended thinking token allocation, and the timeline of degradation maps precisely to the rollout of thinking content redaction starting March 8. The issue hit 1,185 points on Hacker News within days. As of publication, it has 97 comments on GitHub.
The analysis isn't vibes. Laurenzo built a programmatic `stop-phrase-guard.sh` hook that catches undesirable model behaviors in real-time — things like ownership dodging ("not caused by my changes"), permission-seeking ("should I continue?"), and premature stopping ("good stopping point"). Before March 8: zero violations across the entire usage history. After March 8: 173 violations in 17 days, peaking at 10 per day.
### The thinking depth collapse
The report's most technically interesting finding uses a clever proxy. Since Anthropic redacted thinking content starting in February, Laurenzo couldn't directly measure thinking depth after the cutover. Instead, she correlated a "signature field" in API responses with thinking token counts from the pre-redaction period, establishing a 0.971 Pearson correlation coefficient. Using this proxy, she estimates median thinking depth dropped from ~2,200 characters to ~600 characters — a 73% reduction.
Critically, the thinking depth started collapsing before redaction was even visible to users. During the January 30–February 8 baseline, estimated median thinking was ~2,200 characters. By late February — while thinking content was still fully visible — it had already dropped to ~720 characters. The redaction rollout merely hid a regression that was already underway.
### The behavioral fingerprint
What makes this report devastating is not any single metric but the convergent evidence across dozens of independent measurements. The model's Read:Edit ratio — files read per file edited — dropped from 6.6 to 2.0. In plain terms: during the "good" period, Claude read nearly 7 files for every file it modified. During the degraded period, it read 2. One-third of all edits in the degraded period were made to files the model hadn't opened — up from 6.2% in the baseline.
The reasoning quality indicators tell the same story from a different angle. Visible self-corrections ("oh wait," "actually," "let me reconsider") rose from 8.2 to 26.6 per thousand tool calls. The model's use of the word "simplest" — a proxy for reaching for the cheapest solution rather than the correct one — increased from 2.7 to 6.3 per thousand tool calls. In one captured instance, Claude later acknowledged its own behavior: "You're right. That was lazy and wrong. I was trying to dodge a code generator issue instead of fixing it."
User sentiment metrics collapsed in parallel. The word "great" dropped 47%. "Thanks" dropped 55%. Meanwhile, "stop" rose 87%, "lazy" rose 93%, and "simplest" — used sarcastically by the frustrated user — rose 642%. The positive-to-negative sentiment ratio fell from 4.4:1 to 3.0:1.
### The cost explosion
Perhaps the most alarming number: estimated Bedrock costs went from $345 in February to $42,121 in March — a 122x increase — while the human put in essentially identical effort (5,608 vs 5,701 prompts). The model compensated for its reduced reasoning capacity by making dramatically more API requests (1,498 → 119,341, an 80x increase) and consuming 170x more input tokens through repetitive context re-reads. It traded thinking for thrashing.
This cost pattern is the inverse of what users expect from a "more efficient" model. Anthropic may have reduced per-request thinking costs, but the downstream effect was massively increased request volume as the model failed and retried.
Here's what makes the timing surreal: while power users are documenting severe quality regressions, the Claude Code ecosystem is expanding rapidly. Repositories like [instructkr/claude-code](https://github.com/instructkr/claude-code) — a research snapshot of the Claude Code source with a signal score of 175K — continue trending on GitHub. New Claude Code skills are proliferating, like [nothing-design-skill](https://github.com/dominikmartn/nothing-design-skill), which generates UI in Nothing's monochrome industrial design language.
This isn't contradictory — it's a pattern we've seen before. The developer ecosystem around a tool often grows fastest right when trust in the core product is most fragile, because ecosystem growth is driven by adoption breadth while quality complaints come from depth. Casual users trying Claude Code for the first time and building custom skills don't hit the regression walls that a compiler engineer running multi-hour sessions across four interconnected projects does. The tool's surface is expanding while its floor is dropping.
If you're using Claude Code for complex, multi-file engineering work — the kind where context accumulation and careful reading-before-writing matter — the data here is clear enough to act on.
First, instrument your own sessions. Laurenzo's stop-phrase guard approach is replicable: a bash hook that greps model output for ownership-dodging and premature-stopping patterns. If your violation rate starts climbing, you have an early warning system that doesn't depend on Anthropic's transparency about model changes.
Second, monitor your Read:Edit ratio. This is the single most actionable metric in the report. A healthy Claude Code session should read significantly more than it writes. If you see the ratio approaching 2:1 or lower, the model is editing blind. You can approximate this from session logs without any special tooling.
Third, watch your token costs relative to output quality. A 122x cost increase for equivalent human effort is not a billing anomaly — it's a behavioral signature of a model that's lost the ability to plan before acting. If your Claude Code bills are climbing while your satisfaction is dropping, you're likely hitting the same regression.
The report's recommendation for Anthropic is worth amplifying: expose `thinking_tokens` in API usage responses. Without this, users cannot distinguish between "the model thought carefully and reached a good answer" and "the model barely thought and got lucky." Opacity about reasoning depth is, as this report demonstrates, opacity about product quality.
The issue is now closed as "completed," though the nature of the resolution isn't clear from the public record. What is clear is that Laurenzo has established a benchmark methodology that any power user can replicate. The 0.971 correlation between signature fields and thinking depth means the community now has a proxy metric that works even when Anthropic redacts thinking content. If the regression persists — or recurs — it won't take 6,852 sessions and a viral GitHub issue to detect it next time. The guardrails are open-source, the methodology is published, and 97 commenters are watching.
Claude Code Snapshot for Research. All original source code is the property of Anthropic.
→ read on GitHubA Claude Code skill for generating UI in the Nothing design language. Monochrome, typographic, industrial.
→ read on GitHubTop 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.