Lyre/Dissertation.md
2026-06-02 14:13:23 +00:00

28 KiB
Raw Blame History

When Being Polite Fails, Try Poison

The Church of Malware (CoM) does not condone the use or introduction of toxic substances onto the individual nature/human/animal; however AI is neither natural, a human, or actual intelligence.

1 -- The problem is well known (Historical Reference)

This section is here to capture relative historical and legal presidence setforth from studies or cases which are not related to (CoM) but are needed to understand the underlying issues in relation to the modern AI Labs and the models they produce.

1.1 -- Wired v Perplexity

In June 2024, Wired's engineering team watched Perplexity fetch articles it had been explicitly told not to fetch. The site's robots.txt disallowed PerplexityBot. Perplexity's declared crawler honored it. Then requests kept arriving from an undeclared user agent on an AWS IP range, pulling the same URLs and surfacing them, near-verbatim, in Perplexity answers minutes later. When confronted, the company called it a third-party contractor problem.

1.2 -- iFixit v Anthropic

iFixit's CEO posted server logs showing Anthropic's ClaudeBot hitting his site close to a million times in twenty-four hours. Read the Docs disclosed five-figure monthly bandwidth bills driven almost entirely by AI scrapers. Wikimedia reported that roughly sixty-five percent of its most expensive traffic was uncached, high-cost requests now coming from AI crawlers, against a human readership that had not meaningfully grown to account for such an impact. By early 2025, SourceHut's Drew DeVault was writing the same post every other month: """Please stop, we are a small team, we will go down."""

1.3 -- Synapsis

None of those operators were asking for novel protections. They were asking the existing ones to be honored were being willfully ignoring. Every polite mechanism the web has shipped in the last thirty years such as: robots.txt, ai.txt, IETF content-usage preferences, or even the "email us to opt out" forms, has been treated as advisory at best and as a target list at worst. The only language scrapers have demonstrably responded to is cost and corruption. This document will aid the masses to their own contribution to save the planet by way of hack the planet.

2 -- The graveyard of good-faith mechanisms

Every one of these mechanisms assumes the scraper wants to comply. Six years of evidence say they don't.

2.1 -- Robots.txt History

Created in 1994 by Martijn Koster after a rogue crawler took down his server. Its a voluntary “gentlemans agreement” stored as a plain file at the root of your domain. It only works if the crawler chooses to obey it.

2.1.a -- Effectiveness Summary

Crawler Type Compliance Level Details
Major search engines (Googlebot, Bingbot, etc.) High Almost always respect robots.txt.
Major declared AI bots (GPTBot, ClaudeBot, Google-Extended, etc.) Moderate to Good Usually honor blocks for their named agents, but inconsistencies exist.
Aggressive bots (e.g. Bytespider / ByteDance) Poor Frequently ignore robots.txt, rotate user-agents and IPs to evade blocks.
Third-party scrapers & undisclosed bots Very Poor / None Often don't declare any bot name and completely ignore robots.txt.
User-triggered AI crawlers (e.g. Perplexity) Variable Many bypass robots.txt because they appear as normal user requests.

2.2 -- ai.txt, TDM Reservation Protocol & C2PA "Do Not Train" Flags

All of these are voluntary, unsigned, and non-binding. The IETF AI Preferences working group has produced thoughtful drafts, but no frontier AI lab has committed to honoring metadata flags at the ingestion step which is the only step that actually matters. A "do not train" flag in image metadata is essentially a polite request taped to a copy of your work that is already sitting on someone elses GPU.

2.2.a -- Effectiveness Summary

Mechanism Effectiveness Level Details
ai.txt Very Low Similar to robots.txt, compliance remain minimal.
TDM Reservation Protocol (TDMRep) Low Intended to reserve rights for text/data, little to no enforcement.
C2PA "Do Not Train" Flags Low to Moderate Cryptographically signed metadata, easy to strip or ignore.
IETF AI Preferences drafts Emerging / Low No widespread enforcement or adoption by AI labs.

2.3 -- Terms of Service (ToS)

In the case of hiQ v. LinkedIn established that scraping public data is not a CFAA violation. The only remaining claim is breach of ToS, which is theoretically possible but practically ineffective against foundation labs due to standing, damages, jurisdiction, and cost barriers.

2.3.a -- Effectiveness Summary

Mechanism Effectiveness Level Details
Terms of Service (ToS) Very Low Not a CFAA violation Hard to enforce, legal presidence.

2.4 -- Opt out by Email

OpenAI announced Media Manager in May 2024 for creators to opt out of training, but it has still not launched as of 2026. Stability AIs pre-SD3 opt-out processed millions of requests, yet they trained on older LAION and Common Crawl data that already contained the images.

2.4.a -- Effectiveness Summary

Mechanism Effectiveness Level Details
Opt-out Very Low Undermined by ingesting older datasets.

2.5 -- Common Crawl laundering

Labs that respect current techniques still train on old Common Crawl snapshots collected years earlier. Content from past crawls (e.g. your 2019 blog in CC-MAIN-2019-39) remains permanently embedded in models. New Disallow rules have no retroactive effect. The supply chain diffuses responsibility across crawlers, datasets, and trainers.

2.5.a -- Effectiveness Summary

Mechanism Effectiveness Level Details
Common Crawl Laundering Very Low Old snapshots persist in models, no expiration for aged datasets.

3 -- Why being polite failed

The polite mechanisms aren't failing because we haven't iterated on them enough. They're failing due to the incentive's structured towards rewards ignoring them with no enforcement layer's underneath. The failure isn't a series of bad-faith actors, it's structural and resolve to three(3) main reasons.

3.1 -- No enforcement

Crawlers face zero cost for ignoring opt-outs, while the producer bears all the costs (bandwidth, CPU, cache pollution, outages). Polite protocols fail due to cost offloading.

3.2 -- Dataset laundering IS a feature

Labs point upstream to third-party datasets, offloading responsibility. "We trained on a public dataset" provides plausible deniability. The supply chain ensures scrapers get paid while labs get the data to build their products.

3.3 -- The regulatory vacuum

The EU AI Act's Article 53 is the only mechanism with real teeth: it requires general-purpose AI providers to respect Union copyright law and TDM opt-outs. However, it applies only in the EU, only to models placed on the EU market, and only if ingestion of reserved works can be proven. The US has no equivalent. Case: NYT v. OpenAI offers limited relief for one well-funded plaintiff, not the average site operator.

4 -- What scrapers actually respond to

Scrapers are economic actors as they respond to economic signals. There is a usable escalation ladder with measures that scale in proportion to how much hostility the operator has shown toward protecting their content.

4.1 -- Proof-of-work walls (Cost Impact)

Tools like (Xe Iasos) Anubis force suspicious clients (headless browsers or bots) to solve a lightweight JavaScript proof-of-work challenge. Real users pay a negligible one-time cost. Scrapers face high concurrent costs or simply fail. Deployments across multiple sites in 20242025 show 9095% reduction in bot traffic within the first week. Similar tools (e.g. go-away) work on the same principle: impose a per-request cost scrapers didnt budget for.

4.1.a -- Tooling

Tool Description Link
Anubis Popular JS PoW challenge reverse proxy GitHub - TecharoHQ/anubis
go-away Self-hosted abuse detection with PoW and challenges GitHub - WeebDataHoarder/go-away

4.2 -- tarpits (Waste Impact)

Tools like Nepenthes, Iocaine, and Quixotic create hidden mazes of procedurally generated nonsense pages (Markov-chain text with infinite links). They are placed behind Disallow rules in robots.txt and kept unlinked. Compliant crawlers never see them. Violators get trapped, wasting bandwidth, time, and degrading their corpus. Almost zero cost to the site owner.

4.2.a -- Tooling

Tool Description Link
Nepenthes Tarpit generating endless garbage pages for rule-breaking crawlers zadzmo.org/code/nepenthes
Iocaine Reverse-proxy tarpit focused on poisoning AI datasets iocaine.madhouse-project.org
Quixotic Lightweight static tarpit for trapping scrapers marcusb.org/hacks/quixotic.html

4.3 -- Poisoning (Corruption)

Nightshade (from University of Chicago) presents images in CLIP-space to corrupt model concepts (e.g. “dog” drifts toward “cat”). Glaze protects artistic style from mimicry. Both survive common preprocessing and are imperceptible to humans. Text poisoning is less mature: bot-specific fact-flipping, prompt injections, and entity corruption.

4.3.a -- Tooling

Tool Description Link
Nightshade Poisons image training data to destabilize AI concepts nightshade.cs.uchicago.edu
Glaze Protects artistic style from AI mimicry glaze.cs.uchicago.edu

4.4 -- Active denial

More aggressive techniques would encompass decompression bombs (tiny gzip files that expand to gigabytes), slow-loris connections that hold open requests for minutes, and deliberately malformed HTML designed to crash parsers. Coomments in source which are misleading or negative impacts (prompt injection, misleading links). These are served conditionally based on user-agent. Can pose legal issues and riskier than previous methods, use with caution and seek professional legal consulting.

4.4.a -- Tooling

Most operators should start at low cost such as Anubis or equivalent in front of the origin and only escalate when the bot population adapts as these steps are high-risk methods and may violate laws in some jurisdictions. Users

Technique Description Link / Resource
Decompression Bombs Small compressed files that expand massively when decompressed Common in tools like ZipBomb or custom gzip implementations
Slow Loris Holds HTTP connections open with minimal data to exhaust server resources Slowloris and nginx/lua variants
Malformed HTML Intentionally broken markup to crash weak parsers in scrapers Custom server-side logic (no standard open-source tool)

5 -- The objections to the casue

This section acts as 'The devils Advocate' to the concerns outlined previously; user discression is advised. These views may be skewed or bias based on prospective or interpretation.

5.1 -- Poisoning is vandalism

Poisoning only affects unauthorized copies made by the scraper bots which are training the AI's used around the world. Your original content on your server remains untouched.

5.1.a -- Argument in reality

You are ONLY modifying your property in a way that is visible to those who took it without permission. The vandalism analogy borrows moral weight from a scenario which doesnt apply here.

5.2 -- It hurts legitimate research

This objection has a clean solution: gate poisoning and tarpits behind verified-bot detection. Whitelist Internet Archives published UAs/IP ranges, Googlebot and Bingbot via reverse-DNS, and CCBot only if you want to be in Common Crawl.

5.2.a -- Argument in reality

Serious tools already include this logic and its just a simple whitelist you permit or not. You choose who and what can archive or access your content, not the other way around.

5.3 -- Labs will just filter it out

Some will try to limit and reduce wasted time/effort tradeoff for cost effectiveness. Nightshade already survives common preprocessing and future versions will adapt to the times. The goal of poisoning isnt permanent immunity, its to impose cost impact to the bots which will impact the models.

5.3.a -- Argument in reality

Every filter and dataset cleaning pass is an expensive tax you place on the lab. Paid deals (Reddit, AP, FT) happened because scraping became too costly. Make it more expensive for the lab to keep their status quo and profiting from your content without permission.

5.4 -- It's a cat-and-mouse game you can't win.

Correct, but irrelevant. Polite mechanisms arent winning either; theyre not even playing. The real choice is between imposing cost on scrapers and imposing zero cost.

5.4.a -- Argument in reality

A losing game where the other side pays for every move. This is far better than one where the content creator pay's for everything and the scrapers and AI creators pay nothing in return. The Bot and AI Labs gain profitability on the creative works of the victim. The creators have all the rights to protect and control who/m has accesss to their content.

5.5 -- What about the legality

Passively serving garbage content (tarpits, poisoned data) to a requesting visitor is not permitted or has ignored the rules (unauthorized access) is not illegal in it's self.Youre only returning what they asked for. The legal risk is the same as serving a slow page.

5.4.a -- Argument in reality

Active attacks like decompression bombs or parser exploits are riskier and would highly recommend seeking legal counsel and proceed with caution as an individual. Bright line: serve garbage, dont attack; it's on the bot and lab to filter and clean their sources content, not you.

6 -- What can you do over this weekend

The following represents quick and easy to accomplish set of protections which cover both commercial closed and community driven opensource projects. In conjunction these solutions add layers of defense towards the fight against scraper bots used to train AI models. Some solutions listed are paid services, the Church of Malware(CoM) is not associated with, nor directly endorcing these solutions; however, documenting their useful fight against AI bots and the AI labs continual abuse.

6.1 Commercial (Free Tier) Solutions

Method Effectiveness Difficulty Notes
Cloudflare Free Plan High Easy Best starting point. Enable Bot Fight Mode + Super Bot Fight Mode. Automatically blocks most AI scrapers.
Static Site + CDN High Easy Use Cloudflare Pages, Netlify, or Vercel free tiers — all include built-in bot protection.

6.2 Open Source / Self-Hosted Solutions

Method Effectiveness Difficulty Notes
Anubis (PoW Wall) Very High Medium JavaScript proof-of-work challenge. Drops 90-95% of bot traffic. Self-hosted.
robots.txt + Tarpit Medium-High Easy Use Nepenthes or Iocaine on disallowed paths. Traps non-compliant crawlers only.
Rate Limiting (nginx) Medium Easy Built-in nginx rate limiting.
User-Agent + IP Blocking Medium Easy Block known AI bots (GPTBot, Google-Extended, CCBot, etc.) via nginx or Apache.
Fail2Ban Medium Medium Bans repeat offenders that hit tarpits or disallowed paths.
  • Start with Cloudflare Free (easiest/commercial)
  • Integrated with self-hosted: Anubis + Nepenthes
  • Incorportate web server rate limiting (nginx)

6.3.a -- NGINX rate limiting example

    limit_req_zone $binary_remote_addr zone=ai_limit:10m rate=1r/s;
    server {
        limit_req zone=ai_limit burst=5 nodelay;
    }

6.4 -- If you run a small site / business

Deploy Anubis (or equivalent PoW wall) in front of your origin. It's a simple nginx reverse-proxy setup. Additionaly, add a Nepenthes tarpit behind a Disallow path in robots.txt which has versitile deployment options, (Baremetal/Docker deployment). Monitoring the logs should show some positive impact as most sites see a sharp drop in bot traffic and bandwidth costs.

6.5 -- If you're a content creator

Assume every image will be ingested and make ingestion expensive for these AI bots but accessible to the human consumer. For text, use distinctive phrasing and deliberate canary strings as watermarks. Keep a private log of what you published and when. Proof of ingestion is essential for any future legal remedy in case your content has been consumed and regenerated by some model.

6.5.a -- Everything will be consumed

  • Run Glaze on images you post publicly.
  • Run Nightshade on images you want strong protection against fine-tuning.

6.6 -- If you're a publisher, university, or cultural institution

Stop negotiating licensing from weakness. The major deals signed in 20242025 (Reddit, AP, News Corp, Axel Springer) happened only after scraping became expensive or legally risky. Make your content expensive to scrape first then licening second as leverage for negotiation. Labs wont pay for what they can take for free, don't willingly participate as the victim.

7 -- The reframe is the whole point

Opt-out is begging or asking the powerful actors with no incentive to comply to please listen. Poisoning is bargaining; you impose a real cost they must either pay or work around. Polite mechanisms failed because they assumed good faith from actors whose entire business model depends on its absence. The next decade of the open web depends on operators realizing the bargaining power theyve always had is sitting in their own server config, ready to be used.

8 -- References

The intention of section is to capture the references consumed and paraphrased in-order to produce this publication to aid the reader with additional information and resources useful for the acidemic research and study oof the underlying topics discusssed within this document.

8.1 -- Section 1: Documented incidents

Section Claim Source
1.1 Wired investigation: Perplexity ignored robots.txt; undeclared crawler on AWS IP range scraped articles after PerplexityBot was disallowed; Perplexity blamed a third-party contractor Mehrotra & Marchman, "Perplexity Is a Bullshit Machine," WIRED, 19 Jun 2024 — https://www.wired.com/story/perplexity-is-a-bullshit-machine/
1.1 Follow-up: Perplexity hidden crawler details and AWS IP confirmation "Perplexity Is a Bullshit Machine" (above) and Robb Knight, "Perplexity AI Is Lying about Their User Agent," 14 Jun 2024 — https://rknight.me/blog/perplexity-ai-is-lying-about-its-user-agent/
1.2 iFixit CEO: Anthropic ClaudeBot ~1M hits in 24h Kyle Wiens (@kwiens) on X, 24 Jul 2024 — https://x.com/kwiens/status/1816128302542905620 ; coverage: 404 Media, "Anthropic AI Scraper Hammers iFixit's Website a Million Times in a Day," 24 Jul 2024 — https://www.404media.co/anthropic-ai-scraper-hammers-ifixits-website-a-million-times-in-a-day/
1.2 Read the Docs: AI crawler bandwidth abuse (73 TB / month from one crawler, $5,000+ in bandwidth charges) Eric Holscher, "AI crawlers need to be more respectful," Read the Docs blog, 25 Jul 2024 — https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/
1.2 Wikimedia: 65% of most expensive (uncached) traffic from bots; multimedia bandwidth +50% since Jan 2024 Mueller, Danis & Lavagetto, "How crawlers impact the operations of the Wikimedia projects," Diff (Wikimedia), 1 Apr 2025 — https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/
1.2 SourceHut / Drew DeVault: AI crawlers degrading small-team infrastructure Drew DeVault, "Please stop externalizing your costs directly into my face," 17 Mar 2025 — https://drewdevault.com/blog/Stop-externalizing-your-costs-on-me/ ; The Register coverage, 18 Mar 2025 — https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/

8.2 -- Section 2: Polite mechanisms

# Claim Source
2.1 robots.txt history (Martijn Koster, 1994) "A Standard for Robot Exclusion," 1994 — https://www.robotstxt.org/orig.html ; RFC 9309 "Robots Exclusion Protocol" — https://www.rfc-editor.org/rfc/rfc9309.html
2.1 Bytespider / undeclared crawlers ignoring robots.txt and rotating UAs Cloudflare Radar verified bots — https://radar.cloudflare.com/traffic/verified-bots ; Originality.AI, "AI Bot Robots.txt Compliance Study," 2024 — https://originality.ai/blog/ai-bot-robots-txt
2.2 IETF AI Preferences working group IETF AIPREF WG — https://datatracker.ietf.org/wg/aipref/about/
2.2 TDM Reservation Protocol W3C Community Group — https://www.w3.org/community/tdmrep/ ; spec: https://www.w3.org/community/reports/tdmrep/CG-FINAL-tdmrep-20240202/
2.2 C2PA "Do Not Train" / training-and-data-mining assertion C2PA Technical Specification 2.x — https://c2pa.org/specifications/specifications/2.0/specs/C2PA_Specification.html
2.3 hiQ Labs v. LinkedIn (CFAA / public scraping) hiQ Labs, Inc. v. LinkedIn Corp., 9th Cir. 2022 — https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/17-16783.pdf
2.4 OpenAI Media Manager announcement (May 2024) OpenAI, "Our approach to data and AI models," 7 May 2024 — https://openai.com/index/approach-to-data-and-ai/ ; status reporting: TechCrunch, "OpenAI's Media Manager has missed its deadline," Oct 2024 — https://techcrunch.com/2024/10/30/openais-media-manager-where-is-it/
2.4 Stability AI opt-out (pre-SD3) via Have I Been Trained / Spawning Spawning AI / Have I Been Trained — https://haveibeentrained.com ; Stability AI announcement, Dec 2022 — https://stability.ai/news/stable-diffusion-v2-release
2.5 Common Crawl scope / persistence in training corpora Common Crawl — https://commoncrawl.org/ ; Mozilla / 2024 study "Training Data for the Price of a Sandwich" — https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl/

8.3 -- Section 3: Regulation and litigation

# Claim Source
3.3 EU AI Act Article 53 (GPAI obligations re: TDM opt-out) Regulation (EU) 2024/1689, Art. 53 — https://eur-lex.europa.eu/eli/reg/2024/1689/oj ; Commission GPAI Code of Practice — https://digital-strategy.ec.europa.eu/en/policies/ai-code-practice
3.3 NYT v. OpenAI / Microsoft Complaint, S.D.N.Y. 1:23-cv-11195, 27 Dec 2023 — https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf

8.4 -- Section 4: Active countermeasures

# Claim Source
4.1 Anubis (PoW reverse proxy) — 90-95% bot drop reports Project: https://github.com/TecharoHQ/anubis (19.7k stars, MIT) ; documentation: https://anubis.techaro.lol/ ; deployment write-ups: Xe Iaso, "Anubis works," 19 Jan 2025 — https://xeiaso.net/blog/2025/anubis/ ; UNESCO / GNOME GitLab adoption coverage: https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/
4.1 go-away (alternative PoW / abuse detection) https://git.gammaspectra.live/git/go-away (mirror: https://github.com/WeebDataHoarder/go-away)
4.2 Nepenthes tarpit (Aaron / zadzmo) https://zadzmo.org/code/nepenthes/ ; coverage: Ars Technica, "AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt," 28 Jan 2025 — https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/
4.2 Iocaine https://iocaine.madhouse-project.org/ ; source: https://git.madhouse-project.org/algernon/iocaine
4.2 Quixotic Marcus Bointon, https://marcusb.org/hacks/quixotic.html ; source: https://github.com/marcusbuffett/quixotic
4.3 Nightshade (poisoning) — Shan, Ding, Passananti, Zheng, Zhao Project: https://nightshade.cs.uchicago.edu/ ; paper: "Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models," IEEE S&P 2024, arXiv:2310.13828 — https://arxiv.org/abs/2310.13828
4.3 Glaze (style protection) https://glaze.cs.uchicago.edu/ ; paper: "Glaze: Protecting Artists from Style Mimicry by Text-to-Image Models," USENIX Security 2023 — https://www.usenix.org/conference/usenixsecurity23/presentation/shan
4.4 Slowloris attack Robert "RSnake" Hansen, original 2009 — archived: https://web.archive.org/web/20090822001255/http://ha.ckers.org/slowloris/ ; modern impl: https://github.com/gkbrk/slowloris
4.4 Decompression / zip bombs (background) https://www.bamsoftware.com/hacks/zipbomb/

8.5 -- Section 6: Mitigations

# Claim Source
6.1 Cloudflare Bot Fight Mode / AI scraper blocking (free tier, default July 2024) Cloudflare blog, "Declaring your AIndependence: block AI bots, scrapers and crawlers with a single click," 3 Jul 2024 — https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/ ; Cloudflare "AI Audit," Sep 2024 — https://blog.cloudflare.com/cloudflare-ai-audit-control-ai-content-crawlers/
6.3.a nginx limit_req_zone nginx docs — https://nginx.org/en/docs/http/ngx_http_limit_req_module.html
6.6 2024-25 licensing deals (Reddit/Google, AP/OpenAI, News Corp/OpenAI, Axel Springer/OpenAI) Reddit-Google: https://www.reuters.com/technology/reddit-ai-content-licensing-deal-with-google-sources-say-2024-02-22/ ; AP-OpenAI: https://apnews.com/article/openai-chatgpt-associated-press-ap-f86f84c5bcc2f3b98074b38521f5f75a ; News Corp-OpenAI: https://www.wsj.com/business/media/openai-news-corp-strike-deal-23f2e4b3 ; Axel Springer-OpenAI: https://www.axelspringer.com/en/ax-press-release/axel-springer-and-openai-partner-to-deepen-beneficial-use-of-ai-in-journalism

8.6 -- Background reading