SubINaclS 7799699da6 Update Dissertation.md

2026-06-02 14:13:23 +00:00

28 KiB

Raw Blame History

When Being Polite Fails, Try Poison

The Church of Malware (CoM) does not condone the use or introduction of toxic substances onto the individual nature/human/animal; however AI is neither natural, a human, or actual intelligence.

1 -- The problem is well known (Historical Reference)

This section is here to capture relative historical and legal presidence setforth from studies or cases which are not related to (CoM) but are needed to understand the underlying issues in relation to the modern AI Labs and the models they produce.

1.1 -- Wired v Perplexity

In June 2024, Wired's engineering team watched Perplexity fetch articles it had been explicitly told not to fetch. The site's robots.txt disallowed PerplexityBot. Perplexity's declared crawler honored it. Then requests kept arriving from an undeclared user agent on an AWS IP range, pulling the same URLs and surfacing them, near-verbatim, in Perplexity answers minutes later. When confronted, the company called it a third-party contractor problem.

1.2 -- iFixit v Anthropic

iFixit's CEO posted server logs showing Anthropic's ClaudeBot hitting his site close to a million times in twenty-four hours. Read the Docs disclosed five-figure monthly bandwidth bills driven almost entirely by AI scrapers. Wikimedia reported that roughly sixty-five percent of its most expensive traffic was uncached, high-cost requests now coming from AI crawlers, against a human readership that had not meaningfully grown to account for such an impact. By early 2025, SourceHut's Drew DeVault was writing the same post every other month: """Please stop, we are a small team, we will go down."""

1.3 -- Synapsis

None of those operators were asking for novel protections. They were asking the existing ones to be honored were being willfully ignoring. Every polite mechanism the web has shipped in the last thirty years such as: robots.txt, ai.txt, IETF content-usage preferences, or even the "email us to opt out" forms, has been treated as advisory at best and as a target list at worst. The only language scrapers have demonstrably responded to is cost and corruption. This document will aid the masses to their own contribution to save the planet by way of hack the planet.

2 -- The graveyard of good-faith mechanisms

Every one of these mechanisms assumes the scraper wants to comply. Six years of evidence say they don't.

2.1 -- Robots.txt History

Created in 1994 by Martijn Koster after a rogue crawler took down his server. It’s a voluntary “gentleman’s agreement” stored as a plain file at the root of your domain. It only works if the crawler chooses to obey it.

2.1.a -- Effectiveness Summary

Crawler Type	Compliance Level	Details
Major search engines (Googlebot, Bingbot, etc.)	High	Almost always respect robots.txt.
Major declared AI bots (GPTBot, ClaudeBot, Google-Extended, etc.)	Moderate to Good	Usually honor blocks for their named agents, but inconsistencies exist.
Aggressive bots (e.g. Bytespider / ByteDance)	Poor	Frequently ignore robots.txt, rotate user-agents and IPs to evade blocks.
Third-party scrapers & undisclosed bots	Very Poor / None	Often don't declare any bot name and completely ignore robots.txt.
User-triggered AI crawlers (e.g. Perplexity)	Variable	Many bypass robots.txt because they appear as normal user requests.

2.2 -- ai.txt, TDM Reservation Protocol & C2PA "Do Not Train" Flags

All of these are voluntary, unsigned, and non-binding. The IETF AI Preferences working group has produced thoughtful drafts, but no frontier AI lab has committed to honoring metadata flags at the ingestion step which is the only step that actually matters. A "do not train" flag in image metadata is essentially a polite request taped to a copy of your work that is already sitting on someone else’s GPU.

2.2.a -- Effectiveness Summary

Mechanism	Effectiveness Level	Details
ai.txt	Very Low	Similar to robots.txt, compliance remain minimal.
TDM Reservation Protocol (TDMRep)	Low	Intended to reserve rights for text/data, little to no enforcement.
C2PA "Do Not Train" Flags	Low to Moderate	Cryptographically signed metadata, easy to strip or ignore.
IETF AI Preferences drafts	Emerging / Low	No widespread enforcement or adoption by AI labs.

2.3 -- Terms of Service (ToS)

In the case of hiQ v. LinkedIn established that scraping public data is not a CFAA violation. The only remaining claim is breach of ToS, which is theoretically possible but practically ineffective against foundation labs due to standing, damages, jurisdiction, and cost barriers.

2.3.a -- Effectiveness Summary

Mechanism	Effectiveness Level	Details
Terms of Service (ToS)	Very Low	Not a CFAA violation Hard to enforce, legal presidence.

2.4 -- Opt out by Email

OpenAI announced Media Manager in May 2024 for creators to opt out of training, but it has still not launched as of 2026. Stability AI’s pre-SD3 opt-out processed millions of requests, yet they trained on older LAION and Common Crawl data that already contained the images.

2.4.a -- Effectiveness Summary

Mechanism	Effectiveness Level	Details
Opt-out	Very Low	Undermined by ingesting older datasets.

2.5 -- Common Crawl laundering

Labs that respect current techniques still train on old Common Crawl snapshots collected years earlier. Content from past crawls (e.g. your 2019 blog in CC-MAIN-2019-39) remains permanently embedded in models. New Disallow rules have no retroactive effect. The supply chain diffuses responsibility across crawlers, datasets, and trainers.

2.5.a -- Effectiveness Summary

Mechanism	Effectiveness Level	Details
Common Crawl Laundering	Very Low	Old snapshots persist in models, no expiration for aged datasets.

3 -- Why being polite failed

The polite mechanisms aren't failing because we haven't iterated on them enough. They're failing due to the incentive's structured towards rewards ignoring them with no enforcement layer's underneath. The failure isn't a series of bad-faith actors, it's structural and resolve to three(3) main reasons.

3.1 -- No enforcement

Crawlers face zero cost for ignoring opt-outs, while the producer bears all the costs (bandwidth, CPU, cache pollution, outages). Polite protocols fail due to cost offloading.

3.2 -- Dataset laundering IS a feature

Labs point upstream to third-party datasets, offloading responsibility. "We trained on a public dataset" provides plausible deniability. The supply chain ensures scrapers get paid while labs get the data to build their products.

3.3 -- The regulatory vacuum

The EU AI Act's Article 53 is the only mechanism with real teeth: it requires general-purpose AI providers to respect Union copyright law and TDM opt-outs. However, it applies only in the EU, only to models placed on the EU market, and only if ingestion of reserved works can be proven. The US has no equivalent. Case: NYT v. OpenAI offers limited relief for one well-funded plaintiff, not the average site operator.

4 -- What scrapers actually respond to

Scrapers are economic actors as they respond to economic signals. There is a usable escalation ladder with measures that scale in proportion to how much hostility the operator has shown toward protecting their content.

4.1 -- Proof-of-work walls (Cost Impact)

Tools like (Xe Iaso’s) Anubis force suspicious clients (headless browsers or bots) to solve a lightweight JavaScript proof-of-work challenge. Real users pay a negligible one-time cost. Scrapers face high concurrent costs or simply fail. Deployments across multiple sites in 2024–2025 show 90–95% reduction in bot traffic within the first week. Similar tools (e.g. go-away) work on the same principle: impose a per-request cost scrapers didn’t budget for.

4.1.a -- Tooling

Tool	Description	Link
Anubis	Popular JS PoW challenge reverse proxy	GitHub - TecharoHQ/anubis
go-away	Self-hosted abuse detection with PoW and challenges	GitHub - WeebDataHoarder/go-away

4.2 -- tarpits (Waste Impact)

Tools like Nepenthes, Iocaine, and Quixotic create hidden mazes of procedurally generated nonsense pages (Markov-chain text with infinite links). They are placed behind Disallow rules in robots.txt and kept unlinked. Compliant crawlers never see them. Violators get trapped, wasting bandwidth, time, and degrading their corpus. Almost zero cost to the site owner.

4.2.a -- Tooling

Tool	Description	Link
Nepenthes	Tarpit generating endless garbage pages for rule-breaking crawlers	zadzmo.org/code/nepenthes
Iocaine	Reverse-proxy tarpit focused on poisoning AI datasets	iocaine.madhouse-project.org
Quixotic	Lightweight static tarpit for trapping scrapers	marcusb.org/hacks/quixotic.html

4.3 -- Poisoning (Corruption)

Nightshade (from University of Chicago) presents images in CLIP-space to corrupt model concepts (e.g. “dog” drifts toward “cat”). Glaze protects artistic style from mimicry. Both survive common preprocessing and are imperceptible to humans. Text poisoning is less mature: bot-specific fact-flipping, prompt injections, and entity corruption.

4.3.a -- Tooling

Tool	Description	Link
Nightshade	Poisons image training data to destabilize AI concepts	nightshade.cs.uchicago.edu
Glaze	Protects artistic style from AI mimicry	glaze.cs.uchicago.edu

4.4 -- Active denial

More aggressive techniques would encompass decompression bombs (tiny gzip files that expand to gigabytes), slow-loris connections that hold open requests for minutes, and deliberately malformed HTML designed to crash parsers. Coomments in source which are misleading or negative impacts (prompt injection, misleading links). These are served conditionally based on user-agent. Can pose legal issues and riskier than previous methods, use with caution and seek professional legal consulting.

4.4.a -- Tooling

Most operators should start at low cost such as Anubis or equivalent in front of the origin and only escalate when the bot population adapts as these steps are high-risk methods and may violate laws in some jurisdictions. Users

Technique	Description	Link / Resource
Decompression Bombs	Small compressed files that expand massively when decompressed	Common in tools like ZipBomb or custom gzip implementations
Slow Loris	Holds HTTP connections open with minimal data to exhaust server resources	Slowloris and nginx/lua variants
Malformed HTML	Intentionally broken markup to crash weak parsers in scrapers	Custom server-side logic (no standard open-source tool)

5 -- The objections to the casue

This section acts as 'The devils Advocate' to the concerns outlined previously; user discression is advised. These views may be skewed or bias based on prospective or interpretation.

5.1 -- Poisoning is vandalism

Poisoning only affects unauthorized copies made by the scraper bots which are training the AI's used around the world. Your original content on your server remains untouched.

5.1.a -- Argument in reality

You are ONLY modifying your property in a way that is visible to those who took it without permission. The vandalism analogy borrows moral weight from a scenario which doesn’t apply here.

5.2 -- It hurts legitimate research

This objection has a clean solution: gate poisoning and tarpits behind verified-bot detection. Whitelist Internet Archive’s published UAs/IP ranges, Googlebot and Bingbot via reverse-DNS, and CCBot only if you want to be in Common Crawl.

5.2.a -- Argument in reality

Serious tools already include this logic and it’s just a simple whitelist you permit or not. You choose who and what can archive or access your content, not the other way around.

5.3 -- Labs will just filter it out

Some will try to limit and reduce wasted time/effort tradeoff for cost effectiveness. Nightshade already survives common preprocessing and future versions will adapt to the times. The goal of poisoning isn’t permanent immunity, it’s to impose cost impact to the bots which will impact the models.

5.3.a -- Argument in reality

Every filter and dataset cleaning pass is an expensive tax you place on the lab. Paid deals (Reddit, AP, FT) happened because scraping became too costly. Make it more expensive for the lab to keep their status quo and profiting from your content without permission.

5.4 -- It's a cat-and-mouse game you can't win.

Correct, but irrelevant. Polite mechanisms aren’t winning either; they’re not even playing. The real choice is between imposing cost on scrapers and imposing zero cost.

5.4.a -- Argument in reality

A losing game where the other side pays for every move. This is far better than one where the content creator pay's for everything and the scrapers and AI creators pay nothing in return. The Bot and AI Labs gain profitability on the creative works of the victim. The creators have all the rights to protect and control who/m has accesss to their content.

5.5 -- What about the legality

Passively serving garbage content (tarpits, poisoned data) to a requesting visitor is not permitted or has ignored the rules (unauthorized access) is not illegal in it's self.You’re only returning what they asked for. The legal risk is the same as serving a slow page.

5.4.a -- Argument in reality

Active attacks like decompression bombs or parser exploits are riskier and would highly recommend seeking legal counsel and proceed with caution as an individual. Bright line: serve garbage, don’t attack; it's on the bot and lab to filter and clean their sources content, not you.

6 -- What can you do over this weekend

The following represents quick and easy to accomplish set of protections which cover both commercial closed and community driven opensource projects. In conjunction these solutions add layers of defense towards the fight against scraper bots used to train AI models. Some solutions listed are paid services, the Church of Malware(CoM) is not associated with, nor directly endorcing these solutions; however, documenting their useful fight against AI bots and the AI labs continual abuse.

6.1 Commercial (Free Tier) Solutions

Method	Effectiveness	Difficulty	Notes
Cloudflare Free Plan	High	Easy	Best starting point. Enable Bot Fight Mode + Super Bot Fight Mode. Automatically blocks most AI scrapers.
Static Site + CDN	High	Easy	Use Cloudflare Pages, Netlify, or Vercel free tiers — all include built-in bot protection.

6.2 Open Source / Self-Hosted Solutions

Method	Effectiveness	Difficulty	Notes
Anubis (PoW Wall)	Very High	Medium	JavaScript proof-of-work challenge. Drops 90-95% of bot traffic. Self-hosted.
robots.txt + Tarpit	Medium-High	Easy	Use Nepenthes or Iocaine on disallowed paths. Traps non-compliant crawlers only.
Rate Limiting (nginx)	Medium	Easy	Built-in nginx rate limiting.
User-Agent + IP Blocking	Medium	Easy	Block known AI bots (GPTBot, Google-Extended, CCBot, etc.) via nginx or Apache.
Fail2Ban	Medium	Medium	Bans repeat offenders that hit tarpits or disallowed paths.

6.3 -- Recommended Free Stack

Start with Cloudflare Free (easiest/commercial)
Integrated with self-hosted: Anubis + Nepenthes
Incorportate web server rate limiting (nginx)

6.3.a -- NGINX rate limiting example

    limit_req_zone $binary_remote_addr zone=ai_limit:10m rate=1r/s;
    server {
        limit_req zone=ai_limit burst=5 nodelay;
    }

6.4 -- If you run a small site / business

Deploy Anubis (or equivalent PoW wall) in front of your origin. It's a simple nginx reverse-proxy setup. Additionaly, add a Nepenthes tarpit behind a Disallow path in robots.txt which has versitile deployment options, (Baremetal/Docker deployment). Monitoring the logs should show some positive impact as most sites see a sharp drop in bot traffic and bandwidth costs.

6.5 -- If you're a content creator

Assume every image will be ingested and make ingestion expensive for these AI bots but accessible to the human consumer. For text, use distinctive phrasing and deliberate canary strings as watermarks. Keep a private log of what you published and when. Proof of ingestion is essential for any future legal remedy in case your content has been consumed and regenerated by some model.

6.5.a -- Everything will be consumed

Run Glaze on images you post publicly.
Run Nightshade on images you want strong protection against fine-tuning.

6.6 -- If you're a publisher, university, or cultural institution

Stop negotiating licensing from weakness. The major deals signed in 2024–2025 (Reddit, AP, News Corp, Axel Springer) happened only after scraping became expensive or legally risky. Make your content expensive to scrape first then licening second as leverage for negotiation. Labs won’t pay for what they can take for free, don't willingly participate as the victim.

7 -- The reframe is the whole point

Opt-out is begging or asking the powerful actors with no incentive to comply to please listen. Poisoning is bargaining; you impose a real cost they must either pay or work around. Polite mechanisms failed because they assumed good faith from actors whose entire business model depends on its absence. The next decade of the open web depends on operators realizing the bargaining power they’ve always had is sitting in their own server config, ready to be used.

8 -- References

The intention of section is to capture the references consumed and paraphrased in-order to produce this publication to aid the reader with additional information and resources useful for the acidemic research and study oof the underlying topics discusssed within this document.

8.1 -- Section 1: Documented incidents

Section	Claim	Source
1.1	Wired investigation: Perplexity ignored robots.txt; undeclared crawler on AWS IP range scraped articles after PerplexityBot was disallowed; Perplexity blamed a third-party contractor	Mehrotra & Marchman, "Perplexity Is a Bullshit Machine," WIRED, 19 Jun 2024 — https://www.wired.com/story/perplexity-is-a-bullshit-machine/
1.1	Follow-up: Perplexity hidden crawler details and AWS IP confirmation	"Perplexity Is a Bullshit Machine" (above) and Robb Knight, "Perplexity AI Is Lying about Their User Agent," 14 Jun 2024 — https://rknight.me/blog/perplexity-ai-is-lying-about-its-user-agent/
1.2	iFixit CEO: Anthropic ClaudeBot ~1M hits in 24h	Kyle Wiens (@kwiens) on X, 24 Jul 2024 — https://x.com/kwiens/status/1816128302542905620 ; coverage: 404 Media, "Anthropic AI Scraper Hammers iFixit's Website a Million Times in a Day," 24 Jul 2024 — https://www.404media.co/anthropic-ai-scraper-hammers-ifixits-website-a-million-times-in-a-day/
1.2	Read the Docs: AI crawler bandwidth abuse (73 TB / month from one crawler, $5,000+ in bandwidth charges)	Eric Holscher, "AI crawlers need to be more respectful," Read the Docs blog, 25 Jul 2024 — https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/
1.2	Wikimedia: 65% of most expensive (uncached) traffic from bots; multimedia bandwidth +50% since Jan 2024	Mueller, Danis & Lavagetto, "How crawlers impact the operations of the Wikimedia projects," Diff (Wikimedia), 1 Apr 2025 — https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/
1.2	SourceHut / Drew DeVault: AI crawlers degrading small-team infrastructure	Drew DeVault, "Please stop externalizing your costs directly into my face," 17 Mar 2025 — https://drewdevault.com/blog/Stop-externalizing-your-costs-on-me/ ; The Register coverage, 18 Mar 2025 — https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/

8.2 -- Section 2: Polite mechanisms

#	Claim	Source
2.1	robots.txt history (Martijn Koster, 1994)	"A Standard for Robot Exclusion," 1994 — https://www.robotstxt.org/orig.html ; RFC 9309 "Robots Exclusion Protocol" — https://www.rfc-editor.org/rfc/rfc9309.html
2.1	Bytespider / undeclared crawlers ignoring robots.txt and rotating UAs	Cloudflare Radar verified bots — https://radar.cloudflare.com/traffic/verified-bots ; Originality.AI, "AI Bot Robots.txt Compliance Study," 2024 — https://originality.ai/blog/ai-bot-robots-txt
2.2	IETF AI Preferences working group	IETF AIPREF WG — https://datatracker.ietf.org/wg/aipref/about/
2.2	TDM Reservation Protocol	W3C Community Group — https://www.w3.org/community/tdmrep/ ; spec: https://www.w3.org/community/reports/tdmrep/CG-FINAL-tdmrep-20240202/
2.2	C2PA "Do Not Train" / training-and-data-mining assertion	C2PA Technical Specification 2.x — https://c2pa.org/specifications/specifications/2.0/specs/C2PA_Specification.html
2.3	hiQ Labs v. LinkedIn (CFAA / public scraping)	hiQ Labs, Inc. v. LinkedIn Corp., 9th Cir. 2022 — https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/17-16783.pdf
2.4	OpenAI Media Manager announcement (May 2024)	OpenAI, "Our approach to data and AI models," 7 May 2024 — https://openai.com/index/approach-to-data-and-ai/ ; status reporting: TechCrunch, "OpenAI's Media Manager has missed its deadline," Oct 2024 — https://techcrunch.com/2024/10/30/openais-media-manager-where-is-it/
2.4	Stability AI opt-out (pre-SD3) via Have I Been Trained / Spawning	Spawning AI / Have I Been Trained — https://haveibeentrained.com ; Stability AI announcement, Dec 2022 — https://stability.ai/news/stable-diffusion-v2-release
2.5	Common Crawl scope / persistence in training corpora	Common Crawl — https://commoncrawl.org/ ; Mozilla / 2024 study "Training Data for the Price of a Sandwich" — https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl/

8.3 -- Section 3: Regulation and litigation

#	Claim	Source
3.3	EU AI Act Article 53 (GPAI obligations re: TDM opt-out)	Regulation (EU) 2024/1689, Art. 53 — https://eur-lex.europa.eu/eli/reg/2024/1689/oj ; Commission GPAI Code of Practice — https://digital-strategy.ec.europa.eu/en/policies/ai-code-practice
3.3	NYT v. OpenAI / Microsoft	Complaint, S.D.N.Y. 1:23-cv-11195, 27 Dec 2023 — https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf

8.4 -- Section 4: Active countermeasures

#	Claim	Source
4.1	Anubis (PoW reverse proxy) — 90-95% bot drop reports	Project: https://github.com/TecharoHQ/anubis (19.7k stars, MIT) ; documentation: https://anubis.techaro.lol/ ; deployment write-ups: Xe Iaso, "Anubis works," 19 Jan 2025 — https://xeiaso.net/blog/2025/anubis/ ; UNESCO / GNOME GitLab adoption coverage: https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/
4.1	go-away (alternative PoW / abuse detection)	https://git.gammaspectra.live/git/go-away (mirror: https://github.com/WeebDataHoarder/go-away)
4.2	Nepenthes tarpit (Aaron / zadzmo)	https://zadzmo.org/code/nepenthes/ ; coverage: Ars Technica, "AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt," 28 Jan 2025 — https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/
4.2	Iocaine	https://iocaine.madhouse-project.org/ ; source: https://git.madhouse-project.org/algernon/iocaine
4.2	Quixotic	Marcus Bointon, https://marcusb.org/hacks/quixotic.html ; source: https://github.com/marcusbuffett/quixotic
4.3	Nightshade (poisoning) — Shan, Ding, Passananti, Zheng, Zhao	Project: https://nightshade.cs.uchicago.edu/ ; paper: "Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models," IEEE S&P 2024, arXiv:2310.13828 — https://arxiv.org/abs/2310.13828
4.3	Glaze (style protection)	https://glaze.cs.uchicago.edu/ ; paper: "Glaze: Protecting Artists from Style Mimicry by Text-to-Image Models," USENIX Security 2023 — https://www.usenix.org/conference/usenixsecurity23/presentation/shan
4.4	Slowloris attack	Robert "RSnake" Hansen, original 2009 — archived: https://web.archive.org/web/20090822001255/http://ha.ckers.org/slowloris/ ; modern impl: https://github.com/gkbrk/slowloris
4.4	Decompression / zip bombs (background)	https://www.bamsoftware.com/hacks/zipbomb/

8.5 -- Section 6: Mitigations

#	Claim	Source
6.1	Cloudflare Bot Fight Mode / AI scraper blocking (free tier, default July 2024)	Cloudflare blog, "Declaring your AIndependence: block AI bots, scrapers and crawlers with a single click," 3 Jul 2024 — https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/ ; Cloudflare "AI Audit," Sep 2024 — https://blog.cloudflare.com/cloudflare-ai-audit-control-ai-content-crawlers/
6.3.a	nginx limit_req_zone	nginx docs — https://nginx.org/en/docs/http/ngx_http_limit_req_module.html
6.6	2024-25 licensing deals (Reddit/Google, AP/OpenAI, News Corp/OpenAI, Axel Springer/OpenAI)	Reddit-Google: https://www.reuters.com/technology/reddit-ai-content-licensing-deal-with-google-sources-say-2024-02-22/ ; AP-OpenAI: https://apnews.com/article/openai-chatgpt-associated-press-ap-f86f84c5bcc2f3b98074b38521f5f75a ; News Corp-OpenAI: https://www.wsj.com/business/media/openai-news-corp-strike-deal-23f2e4b3 ; Axel Springer-OpenAI: https://www.axelspringer.com/en/ax-press-release/axel-springer-and-openai-partner-to-deepen-beneficial-use-of-ai-in-journalism

8.6 -- Background reading

Cory Doctorow, "AI 'art' and uncanniness," 2024 — https://pluralistic.net/2024/09/27/economic-incentives/
404 Media, "The Open Secret of Google Search: Most of the Internet Is Now AI-Polluted Garbage" — https://www.404media.co/
MIT Technology Review, "The AI crawler wars threaten to make the web more closed for everyone," 11 Feb 2025 — https://www.technologyreview.com/2025/02/11/1111518/ai-crawler-wars-closed-web/
The Register, "Open source devs are fighting AI crawlers with cleverness and vengeance," 18 Mar 2025 — https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/

28 KiB Raw Blame History Unescape Escape

When Being Polite Fails, Try Poison

1 -- The problem is well known (Historical Reference)

1.1 -- Wired v Perplexity

1.2 -- iFixit v Anthropic

1.3 -- Synapsis

2 -- The graveyard of good-faith mechanisms

2.1 -- Robots.txt History

2.1.a -- Effectiveness Summary

2.2 -- ai.txt, TDM Reservation Protocol & C2PA "Do Not Train" Flags

2.2.a -- Effectiveness Summary

2.3 -- Terms of Service (ToS)

2.3.a -- Effectiveness Summary

2.4 -- Opt out by Email

2.4.a -- Effectiveness Summary

2.5 -- Common Crawl laundering

2.5.a -- Effectiveness Summary

3 -- Why being polite failed

3.1 -- No enforcement

3.2 -- Dataset laundering IS a feature

3.3 -- The regulatory vacuum

4 -- What scrapers actually respond to

4.1 -- Proof-of-work walls (Cost Impact)

4.1.a -- Tooling

4.2 -- tarpits (Waste Impact)

4.2.a -- Tooling

4.3 -- Poisoning (Corruption)

4.3.a -- Tooling

4.4 -- Active denial

4.4.a -- Tooling

5 -- The objections to the casue

5.1 -- Poisoning is vandalism

5.1.a -- Argument in reality

5.2 -- It hurts legitimate research

5.2.a -- Argument in reality

5.3 -- Labs will just filter it out

5.3.a -- Argument in reality

5.4 -- It's a cat-and-mouse game you can't win.

5.4.a -- Argument in reality

5.5 -- What about the legality

5.4.a -- Argument in reality

6 -- What can you do over this weekend

6.1 Commercial (Free Tier) Solutions

6.2 Open Source / Self-Hosted Solutions

6.3 -- Recommended Free Stack

6.3.a -- NGINX rate limiting example

6.4 -- If you run a small site / business

6.5 -- If you're a content creator

6.5.a -- Everything will be consumed

6.6 -- If you're a publisher, university, or cultural institution

7 -- The reframe is the whole point

8 -- References

8.1 -- Section 1: Documented incidents

8.2 -- Section 2: Polite mechanisms

8.3 -- Section 3: Regulation and litigation

8.4 -- Section 4: Active countermeasures

8.5 -- Section 6: Mitigations

8.6 -- Background reading

28 KiB

Raw Blame History