Update Dissertation.md

This commit is contained in:
SINS 2026-06-02 15:44:10 +00:00
parent 33342a3c6e
commit 1200d7b2b8

View File

@ -1,8 +1,8 @@
# When Being Polite Fails, Try Poison # When Being Polite Fails, Try Poison
The Church of Malware (CoM) does not condone the use or introduction of toxic substances onto the individual nature/human/animal; however AI is neither natural, a human, or actual intelligence. The Church of Malware (CoM) does not condone the use or introduction of toxic substances onto any individual, human, or animal; however AI is neither natural, a human, or actual intelligence.
## 1 -- The problem is well known (Historical Reference) ## 1 -- The problem is well known (Historical Reference)
This section is here to capture relative historical and legal presidence setforth from studies or cases which are not related to (CoM) but are needed to understand the underlying issues in relation to the modern AI Labs and the models they produce. This section is here to capture relative historical and legal precedence set forth from studies or cases which are not related to (CoM) but are needed to understand the underlying issues in relation to the modern AI Labs and the models they produce.
### 1.1 -- Wired v Perplexity ### 1.1 -- Wired v Perplexity
In June 2024, Wired's engineering team watched Perplexity fetch articles it had been explicitly told not to fetch. The site's robots.txt disallowed PerplexityBot. Perplexity's declared crawler honored it. Then requests kept arriving from an undeclared user agent on an AWS IP range, pulling the same URLs and surfacing them, near-verbatim, in Perplexity answers minutes later. When confronted, the company called it a third-party contractor problem. In June 2024, Wired's engineering team watched Perplexity fetch articles it had been explicitly told not to fetch. The site's robots.txt disallowed PerplexityBot. Perplexity's declared crawler honored it. Then requests kept arriving from an undeclared user agent on an AWS IP range, pulling the same URLs and surfacing them, near-verbatim, in Perplexity answers minutes later. When confronted, the company called it a third-party contractor problem.
@ -10,8 +10,8 @@ In June 2024, Wired's engineering team watched Perplexity fetch articles it had
### 1.2 -- iFixit v Anthropic ### 1.2 -- iFixit v Anthropic
iFixit's CEO posted server logs showing Anthropic's ClaudeBot hitting his site close to a million times in twenty-four hours. Read the Docs disclosed five-figure monthly bandwidth bills driven almost entirely by AI scrapers. Wikimedia reported that roughly sixty-five percent of its most expensive traffic was uncached, high-cost requests now coming from AI crawlers, against a human readership that had not meaningfully grown to account for such an impact. By early 2025, SourceHut's Drew DeVault was writing the same post every other month: """Please stop, we are a small team, we will go down.""" iFixit's CEO posted server logs showing Anthropic's ClaudeBot hitting his site close to a million times in twenty-four hours. Read the Docs disclosed five-figure monthly bandwidth bills driven almost entirely by AI scrapers. Wikimedia reported that roughly sixty-five percent of its most expensive traffic was uncached, high-cost requests now coming from AI crawlers, against a human readership that had not meaningfully grown to account for such an impact. By early 2025, SourceHut's Drew DeVault was writing the same post every other month: """Please stop, we are a small team, we will go down."""
### 1.3 -- Synapsis ### 1.3 -- Synopsis
None of those operators were asking for novel protections. They were asking the existing ones to be honored were being willfully ignoring. Every polite mechanism the web has shipped in the last thirty years such as: robots.txt, ai.txt, IETF content-usage preferences, or even the "email us to opt out" forms, has been treated as advisory at best and as a target list at worst. The only language scrapers have demonstrably responded to is cost and corruption. This document will aid the masses to their own contribution to save the planet by way of hack the planet. None of those operators were asking for novel protections. They were asking the existing ones to be honored and those protections were being willfully ignored. Every polite mechanism the web has shipped in the last thirty years such as: robots.txt, ai.txt, IETF content-usage preferences, or even the "email us to opt out" forms, has been treated as advisory at best and as a target list at worst. The only language scrapers have demonstrably responded to is cost and corruption. This document will aid the masses to their own contribution to save the planet by way of hack the planet.
## 2 -- The graveyard of good-faith mechanisms ## 2 -- The graveyard of good-faith mechanisms
Every one of these mechanisms assumes the scraper *wants* to comply. Six years of evidence say they don't. Every one of these mechanisms assumes the scraper *wants* to comply. Six years of evidence say they don't.
@ -34,18 +34,18 @@ All of these are **voluntary**, **unsigned**, and **non-binding**. The IETF AI P
#### 2.2.a -- Effectiveness Summary #### 2.2.a -- Effectiveness Summary
| Mechanism | Effectiveness Level | Details | | Mechanism | Effectiveness Level | Details |
|------------------------------------|--------------------------|-------| |------------------------------------|--------------------------|-------|
| **ai.txt** | Very Low | Similar to robots.txt, compliance remain minimal. | | **ai.txt** | Very Low | Similar to robots.txt, compliance remains minimal. |
| **TDM Reservation Protocol (TDMRep)** | Low | Intended to reserve rights for text/data, little to no enforcement. | | **TDM Reservation Protocol (TDMRep)** | Low | Intended to reserve rights for text/data, little to no enforcement. |
| **C2PA "Do Not Train" Flags** | Low to Moderate | Cryptographically signed metadata, easy to strip or ignore. | | **C2PA "Do Not Train" Flags** | Low to Moderate | Cryptographically signed metadata, easy to strip or ignore. |
| **IETF AI Preferences drafts** | Emerging / Low | No widespread enforcement or adoption by AI labs. | | **IETF AI Preferences drafts** | Emerging / Low | No widespread enforcement or adoption by AI labs. |
### 2.3 -- Terms of Service (ToS) ### 2.3 -- Terms of Service (ToS)
In the case of hiQ v. LinkedIn established that scraping public data is not a CFAA violation. The only remaining claim is breach of ToS, which is theoretically possible but practically ineffective against foundation labs due to standing, damages, jurisdiction, and cost barriers. In the case of hiQ v. LinkedIn*, the courts established that scraping public data is not a CFAA violation. The only remaining claim is breach of ToS, which is theoretically possible but practically ineffective against foundation labs due to standing, damages, jurisdiction, and cost barriers.
#### 2.3.a -- Effectiveness Summary #### 2.3.a -- Effectiveness Summary
| Mechanism | Effectiveness Level | Details | | Mechanism | Effectiveness Level | Details |
|----------------------------|---------------------|-------| |----------------------------|---------------------|-------|
| **Terms of Service (ToS)** | Very Low | Not a CFAA violation Hard to enforce, legal presidence. | | **Terms of Service (ToS)** | Very Low | Not a CFAA violation. Hard to enforce; weak legal precedence. |
### 2.4 -- Opt out by Email ### 2.4 -- Opt out by Email
OpenAI announced Media Manager in May 2024 for creators to opt out of training, but it has still not launched as of 2026. Stability AIs pre-SD3 opt-out processed millions of requests, yet they trained on older LAION and Common Crawl data that already contained the images. OpenAI announced Media Manager in May 2024 for creators to opt out of training, but it has still not launched as of 2026. Stability AIs pre-SD3 opt-out processed millions of requests, yet they trained on older LAION and Common Crawl data that already contained the images.
@ -64,7 +64,7 @@ Labs that respect current techniques still train on old Common Crawl snapshots c
| **Common Crawl Laundering** | Very Low | Old snapshots persist in models, no expiration for aged datasets. | | **Common Crawl Laundering** | Very Low | Old snapshots persist in models, no expiration for aged datasets. |
## 3 -- Why being polite failed ## 3 -- Why being polite failed
The polite mechanisms aren't failing because we haven't iterated on them enough. They're failing due to the incentive's structured towards rewards ignoring them with no enforcement layer's underneath. The failure isn't a series of bad-faith actors, it's structural and resolve to three(3) main reasons. The polite mechanisms aren't failing because we haven't iterated on them enough. They're failing due to the incentives structured towards rewards ignoring them with no enforcement layers underneath. The failure isn't a series of bad-faith actors, it's structural and resolves to three main reasons.
### 3.1 -- No enforcement ### 3.1 -- No enforcement
Crawlers face zero cost for ignoring opt-outs, while the producer bears all the costs (bandwidth, CPU, cache pollution, outages). Polite protocols fail due to cost offloading. Crawlers face zero cost for ignoring opt-outs, while the producer bears all the costs (bandwidth, CPU, cache pollution, outages). Polite protocols fail due to cost offloading.
@ -108,7 +108,7 @@ Nightshade (from University of Chicago) presents images in CLIP-space to corrupt
| **Glaze** | Protects artistic style from AI mimicry | [glaze.cs.uchicago.edu](https://glaze.cs.uchicago.edu/) | | **Glaze** | Protects artistic style from AI mimicry | [glaze.cs.uchicago.edu](https://glaze.cs.uchicago.edu/) |
### 4.4 -- Active denial ### 4.4 -- Active denial
More aggressive techniques would encompass decompression bombs (tiny gzip files that expand to gigabytes), slow-loris connections that hold open requests for minutes, and deliberately malformed HTML designed to crash parsers. Comments in source which are misleading or negative impacts (prompt injection, misleading links). These are served conditionally based on user-agent. Can pose legal issues and riskier than previous methods, use with caution and seek professional legal consulting. More aggressive techniques would encompass decompression bombs (tiny gzip files that expand to gigabytes), slow-loris connections that hold open requests for minutes, and deliberately malformed HTML designed to crash parsers. This can include comments in source code, misleading links, or prompt injection attempts. These are served conditionally based on user-agent. Can pose legal issues and riskier than previous methods, use with caution and seek professional legal consulting.
#### 4.4.a -- Tooling #### 4.4.a -- Tooling
Most operators should start at low cost such as Anubis or equivalent in front of the origin and only escalate when the bot population adapts as these steps are high-risk methods and may violate laws in some jurisdictions. Most operators should start at low cost such as Anubis or equivalent in front of the origin and only escalate when the bot population adapts as these steps are high-risk methods and may violate laws in some jurisdictions.
@ -119,8 +119,8 @@ Most operators should start at low cost such as Anubis or equivalent in front of
| **Slow Loris** | Holds HTTP connections open with minimal data to exhaust server resources | [Slowloris](https://github.com/gkbrk/slowloris) and nginx/lua variants | | **Slow Loris** | Holds HTTP connections open with minimal data to exhaust server resources | [Slowloris](https://github.com/gkbrk/slowloris) and nginx/lua variants |
| **Malformed HTML** | Intentionally broken markup to crash weak parsers in scrapers | Custom server-side logic (no standard open-source tool) | | **Malformed HTML** | Intentionally broken markup to crash weak parsers in scrapers | Custom server-side logic (no standard open-source tool) |
## 5 -- The objections to the casue ## 5 -- The objections to the cause
This section acts as 'The devils Advocate' to the concerns outlined previously; user discression is advised. These views may be skewed or bias based on prospective or interpretation. This section acts as the Devil's Advocate to the concerns outlined previously; user discretion is advised. These views may be skewed or biased based on perspective or interpretation.
### 5.1 -- Poisoning is vandalism ### 5.1 -- Poisoning is vandalism
Poisoning only affects unauthorized copies made by the scraper bots which are training the AI's used around the world. Your original content on your server remains untouched. Poisoning only affects unauthorized copies made by the scraper bots which are training the AI's used around the world. Your original content on your server remains untouched.
@ -141,19 +141,19 @@ Some will try to limit and reduce wasted time/effort tradeoff for cost effective
Every filter and dataset cleaning pass is an expensive tax you place on the lab. Paid deals (Reddit, AP, FT) happened because scraping became too costly. Make it more expensive for the lab to keep their status quo and profiting from your content without permission. Every filter and dataset cleaning pass is an expensive tax you place on the lab. Paid deals (Reddit, AP, FT) happened because scraping became too costly. Make it more expensive for the lab to keep their status quo and profiting from your content without permission.
### 5.4 -- It's a cat-and-mouse game you can't win. ### 5.4 -- It's a cat-and-mouse game you can't win.
Correct, but irrelevant. Polite mechanisms arent winning either; theyre not even playing. The real choice is between imposing cost on scrapers or imposing zero cost. This is a choice the individual makes and the power of control remains in the creators hands, not the bot or the AI Labs. It's better to be the cat in this game; choose your character. Correct, but irrelevant. Polite mechanisms arent winning either; theyre not even playing. The real choice is between imposing cost on scrapers or imposing zero cost. This is a choice the individual makes and the power of control remains in the creator's hands, not the bot or the AI Labs. It's better to be the cat in this game; choose your character.
#### 5.4.a -- Argument in reality #### 5.4.a -- Argument in reality
A losing game where the other side pays for every move. This is far better than one where the content creator pay's for everything and the scrapers and AI creators pay nothing in return. The Bot and AI Labs gain profitability on the creative works of the victim. The creators have all the rights to protect and control who/m has accesss to their content. A losing game where the other side pays for every move. This is far better than one where the content creator pays for everything and the scrapers and AI creators pay nothing in return. The Bot and AI Labs gain profitability on the creative works of the victim. The creators have all the rights to protect and control whom has access to their content.
### 5.5 -- What about the legality ### 5.5 -- What about the legality
Passively serving garbage content (tarpits, poisoned data) to a requesting visitor that is not permitted or has ignored the rules (unauthorized access) is not illegal in itself. Youre only returning what they asked for. The legal risk is the same as serving a slow page. Passively serving garbage content (tarpits, poisoned data) to a requesting visitor that is not permitted or has ignored the rules (unauthorized access) is not illegal in itself. Youre only returning what they asked for. The legal risk is the same as serving a slow page.
#### 5.4.a -- Argument in reality #### 5.5.a -- Argument in reality
Active attacks like decompression bombs or parser exploits are riskier and would highly recommend seeking legal counsel and proceed with caution as an individual. Bright line: serve garbage, dont attack; it's on the bot and lab to filter and clean their sources content, not you. Active attacks like decompression bombs or parser exploits are riskier and would highly recommend seeking legal counsel and proceed with caution as an individual. Bright line: serve garbage, dont attack; it's on the bot and lab to filter and clean their sources content, not you.
## 6 -- What can you do over this weekend ## 6 -- What can you do over this weekend
The following represents quick and easy to accomplish set of protections which cover both commercial closed and community driven opensource projects. In conjunction these solutions add layers of defense towards the fight against scraper bots used to train AI models. Some solutions listed are paid services, the Church of Malware(CoM) is not associated with, nor directly endorcing these solutions; however, documenting their useful fight against AI bots and the AI labs continual abuse. The following represents quick and easy to accomplish set of protections which cover both commercial closed and community driven opensource projects. In conjunction these solutions add layers of defense towards the fight against scraper bots used to train AI models. Some solutions listed are paid services, the Church of Malware(CoM) is not associated with, nor directly endorsing these solutions; however, documenting their useful fight against AI bots and the AI labs' continued abuse.
### 6.1 -- Commercial (Free Tier) Solutions ### 6.1 -- Commercial (Free Tier) Solutions
@ -174,7 +174,7 @@ The following represents quick and easy to accomplish set of protections which c
### 6.3 -- Recommended Free Stack ### 6.3 -- Recommended Free Stack
- Start with **Cloudflare Free** (easiest/commercial) - Start with **Cloudflare Free** (easiest/commercial)
- Integrated with self-hosted: **Anubis** + **Nepenthes** - Integrated with self-hosted: **Anubis** + **Nepenthes**
- Incorportate web server rate limiting (nginx) - Incorporate web server rate limiting (nginx)
#### 6.3.a -- NGINX rate limiting example #### 6.3.a -- NGINX rate limiting example
``` ```
@ -185,7 +185,7 @@ The following represents quick and easy to accomplish set of protections which c
``` ```
### 6.4 -- If you run a small site / business ### 6.4 -- If you run a small site / business
Deploy Anubis (or equivalent PoW wall) in front of your origin. It's a simple nginx reverse-proxy setup. Additionaly, add a Nepenthes tarpit behind a `Disallow` path in robots.txt which has versitile deployment options, (Baremetal/Docker deployment). Monitoring the logs should show some positive impact as most sites see a sharp drop in bot traffic and bandwidth costs. Deploy Anubis (or equivalent PoW wall) in front of your origin. It's a simple nginx reverse-proxy setup. Additionally, add a Nepenthes tarpit behind a `Disallow` path in robots.txt which has versatile deployment options, (Baremetal/Docker deployment). Monitoring the logs should show some positive impact as most sites see a sharp drop in bot traffic and bandwidth costs.
### 6.5 -- If you're a content creator ### 6.5 -- If you're a content creator
Assume every image will be ingested and make ingestion expensive for these AI bots but accessible to the human consumer. For text, use distinctive phrasing and deliberate canary strings as watermarks. Keep a private log of what you published and when. Proof of ingestion is essential for any future legal remedy in case your content has been consumed and regenerated by some model. Assume every image will be ingested and make ingestion expensive for these AI bots but accessible to the human consumer. For text, use distinctive phrasing and deliberate canary strings as watermarks. Keep a private log of what you published and when. Proof of ingestion is essential for any future legal remedy in case your content has been consumed and regenerated by some model.
@ -195,13 +195,13 @@ Assume every image will be ingested and make ingestion expensive for these AI bo
- Run **Nightshade** on images you want strong protection against fine-tuning. - Run **Nightshade** on images you want strong protection against fine-tuning.
### 6.6 -- If you're a publisher, university, or cultural institution ### 6.6 -- If you're a publisher, university, or cultural institution
Stop negotiating licensing from weakness. The major deals signed in 20242025 (Reddit, AP, News Corp, Axel Springer) happened only after scraping became expensive or legally risky. Make your content expensive to scrape first then licening second as leverage for negotiation. Labs wont pay for what they can take for free, don't willingly participate as the victim. Stop negotiating licensing from weakness. The major deals signed in 20242025 (Reddit, AP, News Corp, Axel Springer) happened only after scraping became expensive or legally risky. Make your content expensive to scrape first, then license second as leverage. Labs wont pay for what they can take for free, don't willingly participate as the victim.
## 7 -- The reframe is the whole point ## 7 -- The reframe is the whole point
Opt-out is begging or asking the powerful actors with no incentive to comply to *please* listen. Poisoning is bargaining; you impose a real cost they must either pay or work around. Polite mechanisms failed because they assumed good faith from actors whose entire business model depends on its absence. The next decade of the open web depends on operators realizing the bargaining power theyve always had is sitting in their own server config, ready to be used. Opt-out is begging or asking the powerful actors with no incentive to comply to *please* listen. Poisoning is bargaining; you impose a real cost they must either pay or work around. Polite mechanisms failed because they assumed good faith from actors whose entire business model depends on its absence. The next decade of the open web depends on operators realizing the bargaining power theyve always had is sitting in their own server config, ready to be used.
## 8 -- References ## 8 -- References
The intention of section is to capture the references consumed and paraphrased in-order to produce this publication to aid the reader with additional information and resources useful for the acidemic research and study of the underlying topics discusssed within this document. The intention of this section is to capture the references consumed and paraphrased in-order to produce this publication to aid the reader with additional information and resources useful for the academic research and study of the underlying topics discussed within this document.
### 8.1 -- Section 1: Documented incidents ### 8.1 -- Section 1: Documented incidents
| Section | Claim | Source | | Section | Claim | Source |