Anthropic Quietly Cut Cache TTL by 12×. Developers Got t...

What Happened

On March 6, 2026, Anthropic changed how Claude Code handles prompt caching — and told nobody. A developer who goes by cnighswonger noticed something odd in their billing: after 33 consecutive days of clean 1-hour cache TTL behavior across two independent machines, 5-minute TTL tokens suddenly reappeared. Within two days, 5-minute writes dominated 83% of traffic. By March 21st, that number hit 93%.

The evidence came from parsing raw session files (`~/.claude/projects/**/*.jsonl`) — 119,866 API calls spanning January through April 2026. The data tells an unambiguous story: February saw zero 5-minute cache creation tokens. March 5th was the last clean 1-hour-only day. March 6th, the first 5-minute tokens appeared. March 8th, they were dominant.

The transition from 100% 1-hour TTL to 93% 5-minute TTL happened in exactly 16 days, with no announcement, no changelog entry, and no documentation update.

Why It Matters

The economics here are straightforward and painful. Cache writes cost $3.75–$6.25 per million tokens depending on the model. Cache reads cost $0.30–$0.50 per million tokens. That's a 12.5× cost multiplier every time your cache expires and context has to be re-uploaded.

With a 1-hour TTL, a developer who pauses for 15 minutes to review a PR, grab coffee, or read documentation comes back to a warm cache. With a 5-minute TTL, that same pause means a full context re-upload at write rates. For a Claude Sonnet 4-6 user making 68,264 API calls in March, the difference was $719.09 — a 25.9% cost premium over what February's pricing model would have produced.

The total damage across four months and two models:

- Claude Sonnet 4-6: $949.08 overpaid (17.1% waste) - Claude Opus 4-6: $1,581.80 overpaid (17.1% waste)

These aren't theoretical numbers. They're computed from actual API responses using Anthropic's own published rates.

But the cost story isn't even the worst part. Pro and subscription users are quota-limited, not just cost-limited — and cache creation tokens count at full rate. Users started hitting 5-hour quota exhaustion for the first time in March. No explanation was offered. The quota walls just appeared.

Anthropic's Defense

Jarred Sumner from Anthropic responded to the issue with a counterargument worth taking seriously: the change was intentional optimization, not a regression. The core claim is that Claude Code now selects TTL tier per-request based on expected cache-reuse patterns. For one-shot requests where context is used once and never revisited, a 1-hour TTL is actually more expensive — 1-hour writes cost roughly 2× base input, versus 1.25× for 5-minute writes.

In other words, Anthropic's position is that the old behavior (1-hour TTL for everything) was a blunt instrument, and the new behavior (per-request TTL selection) is smarter routing that reduces aggregate cost.

There's a reasonable case that Anthropic is right about the economics in aggregate — and completely wrong about how they handled the rollout.

The issue author's cost tables do assume that all 5-minute writes would have been cheap cache reads under 1-hour TTL, which overstates the damage. Some of those writes genuinely were one-shot contexts. But even if you discount the numbers by 30-40%, you're still looking at hundreds of dollars in unexpected costs for a single moderate-usage developer.

Anthropic also acknowledged a real bug: v2.1.90 fixed a client-side issue where sessions that had exhausted quota could get stuck on 5-minute TTL indefinitely. That's a genuine fix. But it also means that for weeks, quota-exhausted sessions were being penalized by a bug that compounded the already-painful TTL change.

What This Means for Your Stack

If you're building on Claude's API, you need to audit your cache behavior now. The data extraction methodology is straightforward — parse your session JSONL files and look for `usage.cache_creation.ephemeral_5m_input_tokens` versus `ephemeral_1h_input_tokens` fields. The tool the issue author built (`cnighswonger/claude-code-cache-fix`) automates this analysis.

Practical mitigations until Anthropic offers a configurable TTL:

- Keep sessions focused. One task per session reduces the chance of cache expiration mid-workflow. If you're pausing for more than 5 minutes, expect a full context re-upload. - Front-load your context. Put your highest-value context (system prompts, CLAUDE.md, project conventions) at the top of every session. If you're paying cache-write rates, spend them on content that earns its weight across the session. - Monitor your token breakdown. Add cache-tier tracking to your API cost dashboards. The 5m/1h split is visible in API responses — there's no reason to fly blind. - Budget for the new reality. If your February costs were your baseline, add 15-25% for the same usage patterns going forward.

The deeper lesson: any cost-sensitive application built on a third-party API needs observability into billing-tier changes, because the provider's changelog won't always tell you.

The Trust Problem

Anthropic asked the community to trust that per-request TTL optimization is better for users in aggregate. Maybe it is. But "trust us, this is cheaper" is a hard sell when the change was invisible, the cost increase was measurable, and the first explanation came from a GitHub issue comment — not a blog post, not a changelog, not an email to affected accounts.

This isn't unique to Anthropic. AWS has a long history of pricing changes that technically improve aggregate economics while hurting specific usage patterns. Google Cloud has deprecated APIs with 90 days notice that broke production systems. The pattern is familiar: provider optimizes for the median user, power users absorb the variance.

The difference is that those providers usually *tell you*. A silent change to billing-critical infrastructure, discovered only through forensic log analysis, is a trust deficit that no amount of economic justification can fully cover.

Looking Ahead

Anthropic deferred the question of whether cache-read quota weighting will be disclosed (issue #45756), and declined to expose a user-configurable TTL setting. Both of those decisions deserve scrutiny. If the per-request optimization is genuinely better, making it transparent costs Anthropic nothing and buys significant goodwill. If it's not — well, that's exactly why users are asking for the knob.

The AI infrastructure market is moving fast enough that developers will tolerate a lot of rough edges. But they won't tolerate surprise invoices. Anthropic has a window to turn this into a trust-building moment: publish the TTL selection logic, expose the configuration, and commit to changelogs for billing-impacting changes. The alternative is that every future cost anomaly gets treated as adversarial until proven otherwise.

Anthropic Quietly Cut Cache TTL by 12×. Developers Got the Bill.

// tldr

// viewpoints

// deep dive

What Happened

Why It Matters

Anthropic's Defense

What This Means for Your Stack

The Trust Problem

Looking Ahead

// read from source

Anthropic silently downgraded cache TTL from 1h → 5M on March 6th

// community takes

Anthropic Quietly Cut Cache TTL by 12×. Developers Got the Bill.

// tldr

// viewpoints

// deep dive

What Happened

Why It Matters

Anthropic's Defense

What This Means for Your Stack

The Trust Problem

Looking Ahead

// read from source

Anthropic silently downgraded cache TTL from 1h → 5M on March 6th

// community takes

// share this