28 KiB
When Being Polite Fails, Try Poison
The Church of Malware (CoM) does not condone the use or introduction of toxic substances onto the individual nature/human/animal; however AI is neither natural, a human, or actual intelligence.
1 -- The problem is well known (Historical Reference)
This section is here to capture relative historical and legal presidence setforth from studies or cases which are not related to (CoM) but are needed to understand the underlying issues in relation to the modern AI Labs and the models they produce.
1.1 -- Wired v Perplexity
In June 2024, Wired's engineering team watched Perplexity fetch articles it had been explicitly told not to fetch. The site's robots.txt disallowed PerplexityBot. Perplexity's declared crawler honored it. Then requests kept arriving from an undeclared user agent on an AWS IP range, pulling the same URLs and surfacing them, near-verbatim, in Perplexity answers minutes later. When confronted, the company called it a third-party contractor problem.
1.2 -- iFixit v Anthropic
iFixit's CEO posted server logs showing Anthropic's ClaudeBot hitting his site close to a million times in twenty-four hours. Read the Docs disclosed five-figure monthly bandwidth bills driven almost entirely by AI scrapers. Wikimedia reported that roughly sixty-five percent of its most expensive traffic was uncached, high-cost requests now coming from AI crawlers, against a human readership that had not meaningfully grown to account for such an impact. By early 2025, SourceHut's Drew DeVault was writing the same post every other month: """Please stop, we are a small team, we will go down."""
1.3 -- Synapsis
None of those operators were asking for novel protections. They were asking the existing ones to be honored were being willfully ignoring. Every polite mechanism the web has shipped in the last thirty years such as: robots.txt, ai.txt, IETF content-usage preferences, or even the "email us to opt out" forms, has been treated as advisory at best and as a target list at worst. The only language scrapers have demonstrably responded to is cost and corruption. This document will aid the masses to their own contribution to save the planet by way of hack the planet.
2 -- The graveyard of good-faith mechanisms
Every one of these mechanisms assumes the scraper wants to comply. Six years of evidence say they don't.
2.1 -- Robots.txt History
Created in 1994 by Martijn Koster after a rogue crawler took down his server. It’s a voluntary “gentleman’s agreement” stored as a plain file at the root of your domain. It only works if the crawler chooses to obey it.
2.1.a -- Effectiveness Summary
| Crawler Type | Compliance Level | Details |
|---|---|---|
| Major search engines (Googlebot, Bingbot, etc.) | High | Almost always respect robots.txt. |
| Major declared AI bots (GPTBot, ClaudeBot, Google-Extended, etc.) | Moderate to Good | Usually honor blocks for their named agents, but inconsistencies exist. |
| Aggressive bots (e.g. Bytespider / ByteDance) | Poor | Frequently ignore robots.txt, rotate user-agents and IPs to evade blocks. |
| Third-party scrapers & undisclosed bots | Very Poor / None | Often don't declare any bot name and completely ignore robots.txt. |
| User-triggered AI crawlers (e.g. Perplexity) | Variable | Many bypass robots.txt because they appear as normal user requests. |
2.2 -- ai.txt, TDM Reservation Protocol & C2PA "Do Not Train" Flags
All of these are voluntary, unsigned, and non-binding. The IETF AI Preferences working group has produced thoughtful drafts, but no frontier AI lab has committed to honoring metadata flags at the ingestion step which is the only step that actually matters. A "do not train" flag in image metadata is essentially a polite request taped to a copy of your work that is already sitting on someone else’s GPU.
2.2.a -- Effectiveness Summary
| Mechanism | Effectiveness Level | Details |
|---|---|---|
| ai.txt | Very Low | Similar to robots.txt, compliance remain minimal. |
| TDM Reservation Protocol (TDMRep) | Low | Intended to reserve rights for text/data, little to no enforcement. |
| C2PA "Do Not Train" Flags | Low to Moderate | Cryptographically signed metadata, easy to strip or ignore. |
| IETF AI Preferences drafts | Emerging / Low | No widespread enforcement or adoption by AI labs. |
2.3 -- Terms of Service (ToS)
In the case of hiQ v. LinkedIn established that scraping public data is not a CFAA violation. The only remaining claim is breach of ToS, which is theoretically possible but practically ineffective against foundation labs due to standing, damages, jurisdiction, and cost barriers.
2.3.a -- Effectiveness Summary
| Mechanism | Effectiveness Level | Details |
|---|---|---|
| Terms of Service (ToS) | Very Low | Not a CFAA violation Hard to enforce, legal presidence. |
2.4 -- Opt out by Email
OpenAI announced Media Manager in May 2024 for creators to opt out of training, but it has still not launched as of 2026. Stability AI’s pre-SD3 opt-out processed millions of requests, yet they trained on older LAION and Common Crawl data that already contained the images.
2.4.a -- Effectiveness Summary
| Mechanism | Effectiveness Level | Details |
|---|---|---|
| Opt-out | Very Low | Undermined by ingesting older datasets. |
2.5 -- Common Crawl laundering
Labs that respect current techniques still train on old Common Crawl snapshots collected years earlier. Content from past crawls (e.g. your 2019 blog in CC-MAIN-2019-39) remains permanently embedded in models. New Disallow rules have no retroactive effect. The supply chain diffuses responsibility across crawlers, datasets, and trainers.
2.5.a -- Effectiveness Summary
| Mechanism | Effectiveness Level | Details |
|---|---|---|
| Common Crawl Laundering | Very Low | Old snapshots persist in models, no expiration for aged datasets. |
3 -- Why being polite failed
The polite mechanisms aren't failing because we haven't iterated on them enough. They're failing due to the incentive's structured towards rewards ignoring them with no enforcement layer's underneath. The failure isn't a series of bad-faith actors, it's structural and resolve to three(3) main reasons.
3.1 -- No enforcement
Crawlers face zero cost for ignoring opt-outs, while the producer bears all the costs (bandwidth, CPU, cache pollution, outages). Polite protocols fail due to cost offloading.
3.2 -- Dataset laundering IS a feature
Labs point upstream to third-party datasets, offloading responsibility. "We trained on a public dataset" provides plausible deniability. The supply chain ensures scrapers get paid while labs get the data to build their products.
3.3 -- The regulatory vacuum
The EU AI Act's Article 53 is the only mechanism with real teeth: it requires general-purpose AI providers to respect Union copyright law and TDM opt-outs. However, it applies only in the EU, only to models placed on the EU market, and only if ingestion of reserved works can be proven. The US has no equivalent. Case: NYT v. OpenAI offers limited relief for one well-funded plaintiff, not the average site operator.
4 -- What scrapers actually respond to
Scrapers are economic actors as they respond to economic signals. There is a usable escalation ladder with measures that scale in proportion to how much hostility the operator has shown toward protecting their content.
4.1 -- Proof-of-work walls (Cost Impact)
Tools like (Xe Iaso’s) Anubis force suspicious clients (headless browsers or bots) to solve a lightweight JavaScript proof-of-work challenge. Real users pay a negligible one-time cost. Scrapers face high concurrent costs or simply fail. Deployments across multiple sites in 2024–2025 show 90–95% reduction in bot traffic within the first week. Similar tools (e.g. go-away) work on the same principle: impose a per-request cost scrapers didn’t budget for.
4.1.a -- Tooling
| Tool | Description | Link |
|---|---|---|
| Anubis | Popular JS PoW challenge reverse proxy | GitHub - TecharoHQ/anubis |
| go-away | Self-hosted abuse detection with PoW and challenges | GitHub - WeebDataHoarder/go-away |
4.2 -- tarpits (Waste Impact)
Tools like Nepenthes, Iocaine, and Quixotic create hidden mazes of procedurally generated nonsense pages (Markov-chain text with infinite links). They are placed behind Disallow rules in robots.txt and kept unlinked. Compliant crawlers never see them. Violators get trapped, wasting bandwidth, time, and degrading their corpus. Almost zero cost to the site owner.
4.2.a -- Tooling
| Tool | Description | Link |
|---|---|---|
| Nepenthes | Tarpit generating endless garbage pages for rule-breaking crawlers | zadzmo.org/code/nepenthes |
| Iocaine | Reverse-proxy tarpit focused on poisoning AI datasets | iocaine.madhouse-project.org |
| Quixotic | Lightweight static tarpit for trapping scrapers | marcusb.org/hacks/quixotic.html |
4.3 -- Poisoning (Corruption)
Nightshade (from University of Chicago) presents images in CLIP-space to corrupt model concepts (e.g. “dog” drifts toward “cat”). Glaze protects artistic style from mimicry. Both survive common preprocessing and are imperceptible to humans. Text poisoning is less mature: bot-specific fact-flipping, prompt injections, and entity corruption.
4.3.a -- Tooling
| Tool | Description | Link |
|---|---|---|
| Nightshade | Poisons image training data to destabilize AI concepts | nightshade.cs.uchicago.edu |
| Glaze | Protects artistic style from AI mimicry | glaze.cs.uchicago.edu |
4.4 -- Active denial
More aggressive techniques would encompass decompression bombs (tiny gzip files that expand to gigabytes), slow-loris connections that hold open requests for minutes, and deliberately malformed HTML designed to crash parsers. Coomments in source which are misleading or negative impacts (prompt injection, misleading links). These are served conditionally based on user-agent. Can pose legal issues and riskier than previous methods, use with caution and seek professional legal consulting.
4.4.a -- Tooling
Most operators should start at low cost such as Anubis or equivalent in front of the origin and only escalate when the bot population adapts as these steps are high-risk methods and may violate laws in some jurisdictions. Users
| Technique | Description | Link / Resource |
|---|---|---|
| Decompression Bombs | Small compressed files that expand massively when decompressed | Common in tools like ZipBomb or custom gzip implementations |
| Slow Loris | Holds HTTP connections open with minimal data to exhaust server resources | Slowloris and nginx/lua variants |
| Malformed HTML | Intentionally broken markup to crash weak parsers in scrapers | Custom server-side logic (no standard open-source tool) |
5 -- The objections to the casue
This section acts as 'The devils Advocate' to the concerns outlined previously; user discression is advised. These views may be skewed or bias based on prospective or interpretation.
5.1 -- Poisoning is vandalism
Poisoning only affects unauthorized copies made by the scraper bots which are training the AI's used around the world. Your original content on your server remains untouched.
5.1.a -- Argument in reality
You are ONLY modifying your property in a way that is visible to those who took it without permission. The vandalism analogy borrows moral weight from a scenario which doesn’t apply here.
5.2 -- It hurts legitimate research
This objection has a clean solution: gate poisoning and tarpits behind verified-bot detection. Whitelist Internet Archive’s published UAs/IP ranges, Googlebot and Bingbot via reverse-DNS, and CCBot only if you want to be in Common Crawl.
5.2.a -- Argument in reality
Serious tools already include this logic and it’s just a simple whitelist you permit or not. You choose who and what can archive or access your content, not the other way around.
5.3 -- Labs will just filter it out
Some will try to limit and reduce wasted time/effort tradeoff for cost effectiveness. Nightshade already survives common preprocessing and future versions will adapt to the times. The goal of poisoning isn’t permanent immunity, it’s to impose cost impact to the bots which will impact the models.
5.3.a -- Argument in reality
Every filter and dataset cleaning pass is an expensive tax you place on the lab. Paid deals (Reddit, AP, FT) happened because scraping became too costly. Make it more expensive for the lab to keep their status quo and profiting from your content without permission.
5.4 -- It's a cat-and-mouse game you can't win.
Correct, but irrelevant. Polite mechanisms aren’t winning either; they’re not even playing. The real choice is between imposing cost on scrapers or imposing zero cost. This is a choice the individual makes and the power of control remains in the creators hands, not the bot or the AI Labs. It's better to be the cat in this game; choose your character.
5.4.a -- Argument in reality
A losing game where the other side pays for every move. This is far better than one where the content creator pay's for everything and the scrapers and AI creators pay nothing in return. The Bot and AI Labs gain profitability on the creative works of the victim. The creators have all the rights to protect and control who/m has accesss to their content.
5.5 -- What about the legality
Passively serving garbage content (tarpits, poisoned data) to a requesting visitor is not permitted or has ignored the rules (unauthorized access) is not illegal in it's self.You’re only returning what they asked for. The legal risk is the same as serving a slow page.
5.4.a -- Argument in reality
Active attacks like decompression bombs or parser exploits are riskier and would highly recommend seeking legal counsel and proceed with caution as an individual. Bright line: serve garbage, don’t attack; it's on the bot and lab to filter and clean their sources content, not you.
6 -- What can you do over this weekend
The following represents quick and easy to accomplish set of protections which cover both commercial closed and community driven opensource projects. In conjunction these solutions add layers of defense towards the fight against scraper bots used to train AI models. Some solutions listed are paid services, the Church of Malware(CoM) is not associated with, nor directly endorcing these solutions; however, documenting their useful fight against AI bots and the AI labs continual abuse.
6.1 -- Commercial (Free Tier) Solutions
| Method | Effectiveness | Difficulty | Notes |
|---|---|---|---|
| Cloudflare Free Plan | High | Easy | Best starting point. Enable Bot Fight Mode + Super Bot Fight Mode. Automatically blocks most AI scrapers. |
| Static Site + CDN | High | Easy | Use Cloudflare Pages, Netlify, or Vercel free tiers — all include built-in bot protection. |
6.2 -- Open Source / Self-Hosted Solutions
| Method | Effectiveness | Difficulty | Notes |
|---|---|---|---|
| Anubis (PoW Wall) | Very High | Medium | JavaScript proof-of-work challenge. Drops 90-95% of bot traffic. Self-hosted. |
| robots.txt + Tarpit | Medium-High | Easy | Use Nepenthes or Iocaine on disallowed paths. Traps non-compliant crawlers only. |
| Rate Limiting (nginx) | Medium | Easy | Built-in nginx rate limiting. |
| User-Agent + IP Blocking | Medium | Easy | Block known AI bots (GPTBot, Google-Extended, CCBot, etc.) via nginx or Apache. |
| Fail2Ban | Medium | Medium | Bans repeat offenders that hit tarpits or disallowed paths. |
6.3 -- Recommended Free Stack
- Start with Cloudflare Free (easiest/commercial)
- Integrated with self-hosted: Anubis + Nepenthes
- Incorportate web server rate limiting (nginx)
6.3.a -- NGINX rate limiting example
limit_req_zone $binary_remote_addr zone=ai_limit:10m rate=1r/s;
server {
limit_req zone=ai_limit burst=5 nodelay;
}
6.4 -- If you run a small site / business
Deploy Anubis (or equivalent PoW wall) in front of your origin. It's a simple nginx reverse-proxy setup. Additionaly, add a Nepenthes tarpit behind a Disallow path in robots.txt which has versitile deployment options, (Baremetal/Docker deployment). Monitoring the logs should show some positive impact as most sites see a sharp drop in bot traffic and bandwidth costs.
6.5 -- If you're a content creator
Assume every image will be ingested and make ingestion expensive for these AI bots but accessible to the human consumer. For text, use distinctive phrasing and deliberate canary strings as watermarks. Keep a private log of what you published and when. Proof of ingestion is essential for any future legal remedy in case your content has been consumed and regenerated by some model.
6.5.a -- Everything will be consumed
- Run Glaze on images you post publicly.
- Run Nightshade on images you want strong protection against fine-tuning.
6.6 -- If you're a publisher, university, or cultural institution
Stop negotiating licensing from weakness. The major deals signed in 2024–2025 (Reddit, AP, News Corp, Axel Springer) happened only after scraping became expensive or legally risky. Make your content expensive to scrape first then licening second as leverage for negotiation. Labs won’t pay for what they can take for free, don't willingly participate as the victim.
7 -- The reframe is the whole point
Opt-out is begging or asking the powerful actors with no incentive to comply to please listen. Poisoning is bargaining; you impose a real cost they must either pay or work around. Polite mechanisms failed because they assumed good faith from actors whose entire business model depends on its absence. The next decade of the open web depends on operators realizing the bargaining power they’ve always had is sitting in their own server config, ready to be used.
8 -- References
The intention of section is to capture the references consumed and paraphrased in-order to produce this publication to aid the reader with additional information and resources useful for the acidemic research and study oof the underlying topics discusssed within this document.
8.1 -- Section 1: Documented incidents
| Section | Claim | Source |
|---|---|---|
| 1.1 | Wired investigation: Perplexity ignored robots.txt; undeclared crawler on AWS IP range scraped articles after PerplexityBot was disallowed; Perplexity blamed a third-party contractor | Mehrotra & Marchman, "Perplexity Is a Bullshit Machine," WIRED, 19 Jun 2024 — https://www.wired.com/story/perplexity-is-a-bullshit-machine/ |
| 1.1 | Follow-up: Perplexity hidden crawler details and AWS IP confirmation | "Perplexity Is a Bullshit Machine" (above) and Robb Knight, "Perplexity AI Is Lying about Their User Agent," 14 Jun 2024 — https://rknight.me/blog/perplexity-ai-is-lying-about-its-user-agent/ |
| 1.2 | iFixit CEO: Anthropic ClaudeBot ~1M hits in 24h | Kyle Wiens (@kwiens) on X, 24 Jul 2024 — https://x.com/kwiens/status/1816128302542905620 ; coverage: 404 Media, "Anthropic AI Scraper Hammers iFixit's Website a Million Times in a Day," 24 Jul 2024 — https://www.404media.co/anthropic-ai-scraper-hammers-ifixits-website-a-million-times-in-a-day/ |
| 1.2 | Read the Docs: AI crawler bandwidth abuse (73 TB / month from one crawler, $5,000+ in bandwidth charges) | Eric Holscher, "AI crawlers need to be more respectful," Read the Docs blog, 25 Jul 2024 — https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/ |
| 1.2 | Wikimedia: 65% of most expensive (uncached) traffic from bots; multimedia bandwidth +50% since Jan 2024 | Mueller, Danis & Lavagetto, "How crawlers impact the operations of the Wikimedia projects," Diff (Wikimedia), 1 Apr 2025 — https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/ |
| 1.2 | SourceHut / Drew DeVault: AI crawlers degrading small-team infrastructure | Drew DeVault, "Please stop externalizing your costs directly into my face," 17 Mar 2025 — https://drewdevault.com/blog/Stop-externalizing-your-costs-on-me/ ; The Register coverage, 18 Mar 2025 — https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/ |
8.2 -- Section 2: Polite mechanisms
| # | Claim | Source |
|---|---|---|
| 2.1 | robots.txt history (Martijn Koster, 1994) | "A Standard for Robot Exclusion," 1994 — https://www.robotstxt.org/orig.html ; RFC 9309 "Robots Exclusion Protocol" — https://www.rfc-editor.org/rfc/rfc9309.html |
| 2.1 | Bytespider / undeclared crawlers ignoring robots.txt and rotating UAs | Cloudflare Radar verified bots — https://radar.cloudflare.com/traffic/verified-bots ; Originality.AI, "AI Bot Robots.txt Compliance Study," 2024 — https://originality.ai/blog/ai-bot-robots-txt |
| 2.2 | IETF AI Preferences working group | IETF AIPREF WG — https://datatracker.ietf.org/wg/aipref/about/ |
| 2.2 | TDM Reservation Protocol | W3C Community Group — https://www.w3.org/community/tdmrep/ ; spec: https://www.w3.org/community/reports/tdmrep/CG-FINAL-tdmrep-20240202/ |
| 2.2 | C2PA "Do Not Train" / training-and-data-mining assertion | C2PA Technical Specification 2.x — https://c2pa.org/specifications/specifications/2.0/specs/C2PA_Specification.html |
| 2.3 | hiQ Labs v. LinkedIn (CFAA / public scraping) | hiQ Labs, Inc. v. LinkedIn Corp., 9th Cir. 2022 — https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/17-16783.pdf |
| 2.4 | OpenAI Media Manager announcement (May 2024) | OpenAI, "Our approach to data and AI models," 7 May 2024 — https://openai.com/index/approach-to-data-and-ai/ ; status reporting: TechCrunch, "OpenAI's Media Manager has missed its deadline," Oct 2024 — https://techcrunch.com/2024/10/30/openais-media-manager-where-is-it/ |
| 2.4 | Stability AI opt-out (pre-SD3) via Have I Been Trained / Spawning | Spawning AI / Have I Been Trained — https://haveibeentrained.com ; Stability AI announcement, Dec 2022 — https://stability.ai/news/stable-diffusion-v2-release |
| 2.5 | Common Crawl scope / persistence in training corpora | Common Crawl — https://commoncrawl.org/ ; Mozilla / 2024 study "Training Data for the Price of a Sandwich" — https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl/ |
8.3 -- Section 3: Regulation and litigation
| # | Claim | Source |
|---|---|---|
| 3.3 | EU AI Act Article 53 (GPAI obligations re: TDM opt-out) | Regulation (EU) 2024/1689, Art. 53 — https://eur-lex.europa.eu/eli/reg/2024/1689/oj ; Commission GPAI Code of Practice — https://digital-strategy.ec.europa.eu/en/policies/ai-code-practice |
| 3.3 | NYT v. OpenAI / Microsoft | Complaint, S.D.N.Y. 1:23-cv-11195, 27 Dec 2023 — https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf |
8.4 -- Section 4: Active countermeasures
| # | Claim | Source |
|---|---|---|
| 4.1 | Anubis (PoW reverse proxy) — 90-95% bot drop reports | Project: https://github.com/TecharoHQ/anubis (19.7k stars, MIT) ; documentation: https://anubis.techaro.lol/ ; deployment write-ups: Xe Iaso, "Anubis works," 19 Jan 2025 — https://xeiaso.net/blog/2025/anubis/ ; UNESCO / GNOME GitLab adoption coverage: https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/ |
| 4.1 | go-away (alternative PoW / abuse detection) | https://git.gammaspectra.live/git/go-away (mirror: https://github.com/WeebDataHoarder/go-away) |
| 4.2 | Nepenthes tarpit (Aaron / zadzmo) | https://zadzmo.org/code/nepenthes/ ; coverage: Ars Technica, "AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt," 28 Jan 2025 — https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/ |
| 4.2 | Iocaine | https://iocaine.madhouse-project.org/ ; source: https://git.madhouse-project.org/algernon/iocaine |
| 4.2 | Quixotic | Marcus Bointon, https://marcusb.org/hacks/quixotic.html ; source: https://github.com/marcusbuffett/quixotic |
| 4.3 | Nightshade (poisoning) — Shan, Ding, Passananti, Zheng, Zhao | Project: https://nightshade.cs.uchicago.edu/ ; paper: "Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models," IEEE S&P 2024, arXiv:2310.13828 — https://arxiv.org/abs/2310.13828 |
| 4.3 | Glaze (style protection) | https://glaze.cs.uchicago.edu/ ; paper: "Glaze: Protecting Artists from Style Mimicry by Text-to-Image Models," USENIX Security 2023 — https://www.usenix.org/conference/usenixsecurity23/presentation/shan |
| 4.4 | Slowloris attack | Robert "RSnake" Hansen, original 2009 — archived: https://web.archive.org/web/20090822001255/http://ha.ckers.org/slowloris/ ; modern impl: https://github.com/gkbrk/slowloris |
| 4.4 | Decompression / zip bombs (background) | https://www.bamsoftware.com/hacks/zipbomb/ |
8.5 -- Section 6: Mitigations
8.6 -- Background reading
- Cory Doctorow, "AI 'art' and uncanniness," 2024 — https://pluralistic.net/2024/09/27/economic-incentives/
- 404 Media, "The Open Secret of Google Search: Most of the Internet Is Now AI-Polluted Garbage" — https://www.404media.co/
- MIT Technology Review, "The AI crawler wars threaten to make the web more closed for everyone," 11 Feb 2025 — https://www.technologyreview.com/2025/02/11/1111518/ai-crawler-wars-closed-web/
- The Register, "Open source devs are fighting AI crawlers with cleverness and vengeance," 18 Mar 2025 — https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/