340+ local newsrooms blocked the Internet Archive. AI tr...

What happened

Nieman Lab reports that more than 340 local news outlets across the United States now block or limit the Internet Archive's access to their journalism. The list is dominated by chain-owned papers: properties under Gannett (~210 dailies and weeklies in the network, with at least 180 blocking ia_archiver), MediaNews Group and Tribune Publishing (both Alden Global Capital portfolios, ~75 titles combined blocking), Lee Enterprises (~40 titles), and Hearst's local-newspaper division (~15). The blocks are implemented through a mix of `robots.txt` disallow rules targeting the `ia_archiver` user agent and server-side filtering that returns 403s or empty pages to Wayback Machine crawlers from known Archive IP ranges.

The Internet Archive is not an AI company, not a competitor, and not a commercial scraper — it is a 501(c)(3) library whose Wayback Machine has, since 1996, served as the de facto citation layer for the open web. When a chain blocks `ia_archiver`, stories published today cannot be snapshotted, and depending on how the block is configured, older snapshots may become inaccessible too. The Reynolds Journalism Institute's longitudinal crawl audit shows Wayback coverage of a fixed sample of 50,000 US local-news URLs falling from ~92% successfully archived in Q1 2022 to ~64% in Q1 2026 — a 28-point collapse in four years, with the steepest drop in the twelve months after ChatGPT's launch.

The pattern correlates with ownership, not editorial stance. Small independent papers are largely still crawlable. The big chains are not.

Is this an AI defense or a paywall move?

Publishers, when they comment at all, offer some mix of "protecting our content from AI training" and "protecting our subscription business." The AI rationale is the one quoted in press releases. The paywall rationale is the one the mechanics actually support.

Here is the problem with the AI framing: blocking `ia_archiver` does essentially nothing to stop LLM training data acquisition. OpenAI's GPTBot, Anthropic's ClaudeBot, Google-Extended, CCBot (Common Crawl), and Meta's FacebookBot are entirely separate user agents with their own robots.txt directives. A site that wants to block AI training data has to block those crawlers by name — and many of the same chains explicitly *allow* GPTBot and Google-Extended because they have signed (or are negotiating) licensing deals worth real money. The New York Times sued OpenAI; Gannett, by contrast, has been reported as exploring direct licensing. You don't sue the Archive and sign with OpenAI if your problem is "AI is training on our content." Your problem is something else.

What the Wayback block *does* accomplish:

1. Paywall enforcement. The most common workaround for a metered or hard paywall on a local-news site is to paste the URL into web.archive.org. Block ia_archiver and that workaround dies for new articles. This is the immediate, mechanical effect. 2. Memory-holing. When a story is updated, retracted, or quietly deleted — common in chains with thin editorial oversight and active defamation exposure — the absence of a Wayback snapshot means the prior version is simply gone. No correction record, no public diff. 3. Litigation control. Hedge-fund-owned chains face a steady drumbeat of defamation suits and public-records disputes. An unarchived web is a more controllable web.

The AI-training story is the press-release version. The paywall-and-memory-hole story is what the configuration files actually do.

Who is doing this

From the Reynolds Institute audit cross-referenced with Archive crawl logs, the rough breakdown of the 340+ blocking outlets:

- Gannett (USA Today Network): ~180 properties. Largest single blocker. Began rolling out blocks late 2023. - Alden Global Capital (MediaNews Group + Tribune Publishing): ~75 properties including the Chicago Tribune, Denver Post, Mercury News, Orange County Register. Alden's papers were among the first movers in 2022. - Lee Enterprises: ~40 properties, mostly midwestern dailies. Rolled out blocks in Q2 2024. - Hearst local newspapers: ~15 properties. - Other chains + independents: the remaining ~30.

Notably absent from the block list: most ProPublica-affiliated nonprofits, the AP, Reuters, and most digital-native locals (Block Club Chicago, The City, Texas Tribune). The split is almost perfectly along ownership lines — hedge-fund and PE-owned chains block; mission-driven and reader-funded outlets do not.

What this breaks

For practitioners, the concrete consequences:

- Citation rot accelerates. Every link to a Gannett article in a 2020 blog post, research paper, court filing, or Wikipedia footnote is now a dead end the moment the article moves or 404s. The Wayback fallback that the entire web has quietly relied on for 25 years no longer exists for ~30% of US local journalism. - OSINT and journalism workflows degrade. Investigations that rely on comparing article versions over time — standard for accountability reporting — now hit walls on exactly the chains most likely to publish-then-quietly-edit. - Tooling assumptions break. If you've built anything that resolves URLs through archive.org as a fallback (link-checkers, research agents, fact-checking pipelines, RAG systems pulling historical context), the success rate against local-news domains has been silently collapsing.

What to do about it

If any part of your stack depends on archive.org for local-news URLs, treat that dependency as failing and start mitigating now:

1. Mirror critical sources yourself. For any URL you cite or depend on, push a copy to your own storage (S3, B2, a private archive box) at ingestion time. Don't trust that the Wayback snapshot will be there in a year. 2. Use archive.today (archive.ph) as a secondary. It's a separate operation, hosted offshore, and most chains have not bothered to block it because its crawl pattern doesn't trigger their detection. Coverage is spottier but the gap is different. 3. Push for the `archive` permission in robots.txt. The IETF working group has proposed a granular robots.txt vocabulary that would let sites grant archival access while denying AI training. Right now publishers conflate the two on purpose. Granular controls remove their excuse. 4. Donate to the Internet Archive. They are simultaneously being sued by publishers (Hachette v. Internet Archive), blocked by publishers, and starved of the public funding that would let them fight back. They run on ~$25M a year and are the load-bearing memory of the open web.

The story being told is about AI. The infrastructure being dismantled is the public record. Those are not the same thing, and the people building today should not pretend otherwise.

340+ local newsrooms blocked the Internet Archive. AI training is the excuse, not the reason.

// tldr

// viewpoints

// deep dive

What happened

Is this an AI defense or a paywall move?

Who is doing this

What this breaks

What to do about it

// read from source

More than 340 local news outlets are limiting the Internet Archive's access

// community takes

340+ local newsrooms blocked the Internet Archive. AI training is the excuse, not the reason.

// tldr

// viewpoints

// deep dive

What happened

Is this an AI defense or a paywall move?

Who is doing this

What this breaks

What to do about it

// read from source

More than 340 local news outlets are limiting the Internet Archive's access

// community takes

// share this