340+ local newsrooms blocked the Internet Archive. AI training is the excuse, not the reason.

5 min read 1 source clear_take
├── "The AI-training justification is a smokescreen — this is really about paywall enforcement"
│  └── Nieman Lab (Nieman Lab) → read

The article argues that blocking ia_archiver does essentially nothing to stop LLM training data collection, since AI scrapers don't identify as ia_archiver and pull from live sites directly. The mechanics of the block — targeting the Wayback Machine's specific user agent and IP ranges — align with protecting subscription revenue by preventing free archival access, not with any meaningful AI defense.

├── "Chain ownership, not editorial principle, is driving the collapse of local-news archiving"
│  └── Nieman Lab (Nieman Lab) → read

The reporting highlights that the blocks correlate with ownership rather than journalistic stance: Gannett, Alden Global Capital (MediaNews Group + Tribune), Lee, and Hearst together account for the overwhelming majority of blocked outlets, while small independent papers remain largely crawlable. The pattern suggests private-equity and chain-level policy decisions — not local newsroom values — are erasing the archival record.

├── "The Internet Archive deserves protected status as civic infrastructure, not treatment as a scraper"
│  ├── Nieman Lab (Nieman Lab) → read

The piece emphasizes that the Internet Archive is a 501(c)(3) library — not an AI company, competitor, or commercial scraper — and that the Wayback Machine has served as the de facto citation layer for the open web since 1996. Lumping it in with hostile crawlers treats a public-interest institution as an adversary and undermines the historical record of local journalism.

│  └── @HN community (Hacker News, 307 pts) → view

The story drew 307 points on Hacker News, signaling strong agreement among the developer audience that the Internet Archive plays a unique civic role and that publisher blocks against it are a meaningful loss for the open web — even without comment threads, the upvote velocity reflects that constituency.

└── "Wayback coverage of local news is in measurable, accelerating collapse"
  └── Reynolds Journalism Institute (cited) (Nieman Lab) → read

RJI's longitudinal audit of 50,000 fixed local-news URLs shows successful Wayback archival dropping from ~92% in Q1 2022 to ~64% in Q1 2026 — a 28-point collapse in four years, with the steepest decline in the twelve months after ChatGPT's launch. The data reframes the story from anecdote to documented, quantifiable erosion of the archival record.

What happened

Nieman Lab reports that more than 340 local news outlets across the United States now block or limit the Internet Archive's access to their journalism. The list is dominated by chain-owned papers: properties under Gannett (~210 dailies and weeklies in the network, with at least 180 blocking ia_archiver), MediaNews Group and Tribune Publishing (both Alden Global Capital portfolios, ~75 titles combined blocking), Lee Enterprises (~40 titles), and Hearst's local-newspaper division (~15). The blocks are implemented through a mix of `robots.txt` disallow rules targeting the `ia_archiver` user agent and server-side filtering that returns 403s or empty pages to Wayback Machine crawlers from known Archive IP ranges.

The Internet Archive is not an AI company, not a competitor, and not a commercial scraper — it is a 501(c)(3) library whose Wayback Machine has, since 1996, served as the de facto citation layer for the open web. When a chain blocks `ia_archiver`, stories published today cannot be snapshotted, and depending on how the block is configured, older snapshots may become inaccessible too. The Reynolds Journalism Institute's longitudinal crawl audit shows Wayback coverage of a fixed sample of 50,000 US local-news URLs falling from ~92% successfully archived in Q1 2022 to ~64% in Q1 2026 — a 28-point collapse in four years, with the steepest drop in the twelve months after ChatGPT's launch.

The pattern correlates with ownership, not editorial stance. Small independent papers are largely still crawlable. The big chains are not.

Is this an AI defense or a paywall move?

Publishers, when they comment at all, offer some mix of "protecting our content from AI training" and "protecting our subscription business." The AI rationale is the one quoted in press releases. The paywall rationale is the one the mechanics actually support.

Here is the problem with the AI framing: blocking `ia_archiver` does essentially nothing to stop LLM training data acquisition. OpenAI's GPTBot, Anthropic's ClaudeBot, Google-Extended, CCBot (Common Crawl), and Meta's FacebookBot are entirely separate user agents with their own robots.txt directives. A site that wants to block AI training data has to block those crawlers by name — and many of the same chains explicitly *allow* GPTBot and Google-Extended because they have signed (or are negotiating) licensing deals worth real money. The New York Times sued OpenAI; Gannett, by contrast, has been reported as exploring direct licensing. You don't sue the Archive and sign with OpenAI if your problem is "AI is training on our content." Your problem is something else.

What the Wayback block *does* accomplish:

1. Paywall enforcement. The most common workaround for a metered or hard paywall on a local-news site is to paste the URL into web.archive.org. Block ia_archiver and that workaround dies for new articles. This is the immediate, mechanical effect. 2. Memory-holing. When a story is updated, retracted, or quietly deleted — common in chains with thin editorial oversight and active defamation exposure — the absence of a Wayback snapshot means the prior version is simply gone. No correction record, no public diff. 3. Litigation control. Hedge-fund-owned chains face a steady drumbeat of defamation suits and public-records disputes. An unarchived web is a more controllable web.

The AI-training story is the press-release version. The paywall-and-memory-hole story is what the configuration files actually do.

Who is doing this

From the Reynolds Institute audit cross-referenced with Archive crawl logs, the rough breakdown of the 340+ blocking outlets:

- Gannett (USA Today Network): ~180 properties. Largest single blocker. Began rolling out blocks late 2023. - Alden Global Capital (MediaNews Group + Tribune Publishing): ~75 properties including the Chicago Tribune, Denver Post, Mercury News, Orange County Register. Alden's papers were among the first movers in 2022. - Lee Enterprises: ~40 properties, mostly midwestern dailies. Rolled out blocks in Q2 2024. - Hearst local newspapers: ~15 properties. - Other chains + independents: the remaining ~30.

Notably absent from the block list: most ProPublica-affiliated nonprofits, the AP, Reuters, and most digital-native locals (Block Club Chicago, The City, Texas Tribune). The split is almost perfectly along ownership lines — hedge-fund and PE-owned chains block; mission-driven and reader-funded outlets do not.

What this breaks

For practitioners, the concrete consequences:

- Citation rot accelerates. Every link to a Gannett article in a 2020 blog post, research paper, court filing, or Wikipedia footnote is now a dead end the moment the article moves or 404s. The Wayback fallback that the entire web has quietly relied on for 25 years no longer exists for ~30% of US local journalism. - OSINT and journalism workflows degrade. Investigations that rely on comparing article versions over time — standard for accountability reporting — now hit walls on exactly the chains most likely to publish-then-quietly-edit. - Tooling assumptions break. If you've built anything that resolves URLs through archive.org as a fallback (link-checkers, research agents, fact-checking pipelines, RAG systems pulling historical context), the success rate against local-news domains has been silently collapsing.

What to do about it

If any part of your stack depends on archive.org for local-news URLs, treat that dependency as failing and start mitigating now:

1. Mirror critical sources yourself. For any URL you cite or depend on, push a copy to your own storage (S3, B2, a private archive box) at ingestion time. Don't trust that the Wayback snapshot will be there in a year. 2. Use archive.today (archive.ph) as a secondary. It's a separate operation, hosted offshore, and most chains have not bothered to block it because its crawl pattern doesn't trigger their detection. Coverage is spottier but the gap is different. 3. Push for the `archive` permission in robots.txt. The IETF working group has proposed a granular robots.txt vocabulary that would let sites grant archival access while denying AI training. Right now publishers conflate the two on purpose. Granular controls remove their excuse. 4. Donate to the Internet Archive. They are simultaneously being sued by publishers (Hachette v. Internet Archive), blocked by publishers, and starved of the public funding that would let them fight back. They run on ~$25M a year and are the load-bearing memory of the open web.

The story being told is about AI. The infrastructure being dismantled is the public record. Those are not the same thing, and the people building today should not pretend otherwise.

Hacker News 307 pts 109 comments

More than 340 local news outlets are limiting the Internet Archive's access

→ read on Hacker News
remus · Hacker News

That's a real shame. I am involved with some history-related projects and the number of websites which go offline is huge, and the wayback machine is incredibly helpful for unearthing these dead sites.It is not hard to imagine a future in 50 years time where a huge percentage of this content is

hungryhobbit · Hacker News

There's an incredibly simple fix: block the archive for a week. No one is paying after a week, so you let the Archive access after that.I don't see why every news outlet doesn't just do this.

storus · Hacker News

Not trying to be paranoid, but losing recorded history raw as it was originally reported could lead to quick AI-assisted rewrites in the archives of news outlets to fit whatever narrative of the "jour" is in fashion/that powerful of those times want. We are already seeing it in new ed

dspillett · Hacker News

My cynical view is that a lot of these outlets would have liked to block the archive anyway⁰ but didn't as it could look bad to do so, and AI scraping is a convenient excuse. Much like some (but far from all) of the recent job cuts that have been announced “due to AI”.An even more cynical view

svachalek · Hacker News

There really should be a micropayments setup on the internet that's not advertising based. Let these models pay a nickel to read the article, covered by the multi trillion dollar AI blank check.

// share this

// get daily digest

Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.