Your Server Doesn't Know Who's Real Anymore

4 min read 1 source explainer
├── "The real bot traffic share is far worse than reported because detection-based measurements systematically undercount"
│  └── Glade Art (gladeart.com) → read

Argues that the commonly cited 51% figure from Imperva/Thales reports only captures bots their detection systems identify. The true number is 51% plus an unknown and growing fraction of bots sophisticated enough to evade detection, making the problem structurally worse than headlines suggest.

├── "The economics of bot operation have collapsed, making the problem irreversible at current defense paradigms"
│  └── Glade Art (gladeart.com) → read

Points out that running headless Chrome with residential proxy rotation costs under $1/hour and CAPTCHA solving runs $0.50–$2 per thousand. The barrier to generating convincing automated traffic is now effectively zero, meaning defense costs will always outpace attacker costs.

└── "The downstream data contamination is the most dangerous consequence — product decisions and ML models are being built on polluted metrics"
  └── top10.dev editorial (top10.dev) → read below

Argues that the first-order issue is epistemic: conversion funnels, A/B test results, capacity planning, and engagement metrics all include an unknown proportion of non-human activity. A/B test winners may simply be the variant bots preferred, and recommendation engines are being trained on what bots click on — making every data-driven decision suspect.

What happened

A detailed breakdown from Glade Art has been making rounds on Hacker News (174 points and climbing), laying out why the bot problem on the internet is structurally worse than the headline "51% of traffic is bots" suggests. The piece argues that the commonly cited Imperva/Thales reports — which have tracked bot traffic share for over a decade — actually *undercount* the problem because they only measure what their detection systems catch.

The real number isn't 51%. It's 51% plus whatever fraction of "human" traffic is actually bots good enough to evade detection. And that fraction is growing, because the economics of bot operation have fundamentally shifted. Running a headless Chrome instance with residential proxy rotation costs under $1/hour. Solving CAPTCHAs at scale costs $0.50–$2 per thousand. The barrier to generating convincing automated traffic is now effectively zero.

The post resonated because it names what many backend engineers have quietly observed: the traffic hitting their servers increasingly doesn't behave like the traffic their systems were designed for.

Why it matters

### The measurement problem

The first-order issue is epistemic. If you can't reliably distinguish human requests from automated ones, every metric downstream of that distinction is contaminated. Your conversion funnel, your A/B test results, your capacity planning models, and your engagement metrics are all built on data that includes an unknown and growing proportion of non-human activity.

This isn't theoretical. Teams running A/B tests with standard analytics tooling are making product decisions based on cohorts that include bots. If bot traffic isn't uniformly distributed across variants — and there's no reason to assume it would be — your winning variant might just be the one that bots preferred. The same applies to ML models trained on user behavior data: you're teaching your recommendation engine what bots click on.

### The infrastructure cost problem

Modern bots don't just scrape — they render JavaScript, execute API calls, and maintain session state. They consume compute, bandwidth, and database connections exactly like real users. If 40–60% of your traffic is automated, you're paying for 40–60% more infrastructure than your actual user base requires.

For teams running autoscaling on cloud infrastructure, this is particularly insidious. Your scaling policies respond to load, and bots generate load. You're autoscaling to serve machines pretending to be humans, and your cloud bill reflects it. The bot operators pay $1/hour for their headless browsers; you pay $50/hour for the EC2 instances serving them.

### The arms race problem

The traditional defense stack — user-agent filtering, IP rate limiting, CAPTCHAs — was designed for a world where bots were distinguishable from humans. That world is gone.

Modern bot frameworks like Puppeteer with stealth plugins, Playwright with fingerprint spoofing, and commercial scraping APIs (Bright Data, Oxylabs, ScrapingBee) produce traffic that is pixel-perfect indistinguishable from a real Chrome user. They randomize mouse movements, simulate scroll patterns, maintain cookie jars, and rotate through residential IP pools that share address space with real ISP customers. Rate limiting by IP catches amateurs; professionals distribute across thousands of addresses.

CAPTCHA services like reCAPTCHA and hCaptcha have responded by making challenges harder — which primarily degrades the experience for legitimate users while bot operators route challenges to solving farms or ML solvers. Google's own reCAPTCHA v3 attempts to score "humanness" without a visible challenge, but it's a probabilistic system making binary decisions, and the false positive rate on privacy-conscious users (VPN, Tor, ad blockers) is high enough to be a business problem.

What this means for your stack

### Treat bot defense as architecture, not configuration

If you're still handling bot traffic with nginx rules and a WAF, you're fighting a 2026 problem with 2016 tools. The effective approaches now are layered and behavioral:

1. Server-side fingerprinting beyond the request. TLS fingerprinting (JA3/JA4 hashes) can identify the actual TLS library making the connection, which is much harder to spoof than a user-agent string. A request claiming to be Chrome but using a Go TLS stack is not Chrome.

2. Behavioral analysis over session lifetime. Real humans exhibit timing patterns that are expensive to simulate at scale — variable inter-request delays, non-uniform page traversal, referrer chains that match organic navigation. Building a scoring model on session behavior rather than individual requests raises the cost of evasion significantly.

3. Proof-of-work challenges. Instead of CAPTCHAs (which outsource cost to humans), some teams are experimenting with computational proof-of-work — requiring the client to solve a hash puzzle before serving the response. This imposes negligible cost on individual humans but makes operating thousands of concurrent bot sessions economically painful.

4. Accept the baseline and design around it. For analytics, this means filtering before aggregation — applying bot-probability scores to pageview data before it enters your dashboards, not after. For A/B testing, it means excluding sessions that fail behavioral checks from your experiment cohorts. For ML pipelines, it means treating provenance as a feature.

### Audit your assumptions

Review every system that implicitly assumes "one request = one human intent." Billing systems that charge per API call. Recommendation engines trained on click data. Fraud detection systems that use traffic patterns as a signal. Rate limiters that trust authenticated sessions. Each of these has a bot-shaped vulnerability that grows as automated traffic increases.

Looking ahead

The uncomfortable trajectory here is clear: as LLMs make it trivial to generate human-like text, behavior, and interaction patterns, the gap between "real" and "automated" traffic will continue to narrow. The internet is becoming a place where the default assumption for any given request should be "probably not human" — and our tools, metrics, and architectures haven't caught up to that reality. The teams that adapt earliest — by building bot-awareness into their core infrastructure rather than bolting it on as a security concern — will waste less money, make better product decisions, and ship to actual users instead of to the void.

Hacker News 174 pts 126 comments

The bot situation on the internet is worse than you could imagine

<a href="https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20260329052632&#x2F;https:&#x2F;&#x2F;gladeart.com&#x2F;blog&#x2F;the-bot-situation-on-the-internet-is-actually-worse-than-you-could-imagine-he

→ read on Hacker News
lm411 · Hacker News

AI companies and notably AI scrapers are a cancer that is destroying what&#x27;s left of the WWW.I was hit with a pretty substantial botnet &quot;distributed scraping&quot; attack yesterday.- About 400,000 different IP addresses over about 3 hours- Mostly residential IP addresses- Valid and unique u

oasisbob · Hacker News

Knew it was getting bad, but Meta&#x27;s facebookexternalhit bot changed their behavior recently.In addition to pulling responses with huge amplification (40x, at least, for posting a single Facebook post to an empty audience), it&#x27;s sending us traffic with fbclids in the mix. No idea why.They&#

pinkmuffinere · Hacker News

I’ve been sitting on this page for two minutes and it’s still not sure whether I’m a bot lol. What did I do in a past life to deserve this :(

salomonk_mur · Hacker News

I&#x27;m surprised at the effectiveness of simple PoW to stop practically all activity.I&#x27;ll implement Anubis at low difficulty for all my projects and leave a decent llms.txt referenced in my sitemap and robots.txt so LLMs can still get relevant data for my site while.keeping bad bots out. I&#x

simonw · Hacker News

&gt; These bots are almost certainly scraping data for AI training; normal bad actors don&#x27;t have funding for millions of unique IPs thrown at a page. They probably belong to several different companies. Perhaps they sell their scraped data to AI companies, or they are AI companies themselves. W

// share this

// get daily digest

Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.