diff --git a/Dissertation.md b/Dissertation.md index 26cf784..c332725 100644 --- a/Dissertation.md +++ b/Dissertation.md @@ -1,8 +1,8 @@ # When Being Polite Fails, Try Poison -The Church of Malware (CoM) does not condone the use or introduction of toxic substances onto the individual nature/human/animal; however AI is neither natural, a human, or actual intelligence. +The Church of Malware (CoM) does not condone the use or introduction of toxic substances onto any individual, human, or animal; however AI is neither natural, a human, or actual intelligence. ## 1 -- The problem is well known (Historical Reference) -This section is here to capture relative historical and legal presidence setforth from studies or cases which are not related to (CoM) but are needed to understand the underlying issues in relation to the modern AI Labs and the models they produce. +This section is here to capture relative historical and legal precedence set forth from studies or cases which are not related to (CoM) but are needed to understand the underlying issues in relation to the modern AI Labs and the models they produce. ### 1.1 -- Wired v Perplexity In June 2024, Wired's engineering team watched Perplexity fetch articles it had been explicitly told not to fetch. The site's robots.txt disallowed PerplexityBot. Perplexity's declared crawler honored it. Then requests kept arriving from an undeclared user agent on an AWS IP range, pulling the same URLs and surfacing them, near-verbatim, in Perplexity answers minutes later. When confronted, the company called it a third-party contractor problem. @@ -10,8 +10,8 @@ In June 2024, Wired's engineering team watched Perplexity fetch articles it had ### 1.2 -- iFixit v Anthropic iFixit's CEO posted server logs showing Anthropic's ClaudeBot hitting his site close to a million times in twenty-four hours. Read the Docs disclosed five-figure monthly bandwidth bills driven almost entirely by AI scrapers. Wikimedia reported that roughly sixty-five percent of its most expensive traffic was uncached, high-cost requests now coming from AI crawlers, against a human readership that had not meaningfully grown to account for such an impact. By early 2025, SourceHut's Drew DeVault was writing the same post every other month: """Please stop, we are a small team, we will go down.""" -### 1.3 -- Synapsis -None of those operators were asking for novel protections. They were asking the existing ones to be honored were being willfully ignoring. Every polite mechanism the web has shipped in the last thirty years such as: robots.txt, ai.txt, IETF content-usage preferences, or even the "email us to opt out" forms, has been treated as advisory at best and as a target list at worst. The only language scrapers have demonstrably responded to is cost and corruption. This document will aid the masses to their own contribution to save the planet by way of hack the planet. +### 1.3 -- Synopsis +None of those operators were asking for novel protections. They were asking the existing ones to be honored and those protections were being willfully ignored. Every polite mechanism the web has shipped in the last thirty years such as: robots.txt, ai.txt, IETF content-usage preferences, or even the "email us to opt out" forms, has been treated as advisory at best and as a target list at worst. The only language scrapers have demonstrably responded to is cost and corruption. This document will aid the masses to their own contribution to save the planet by way of hack the planet. ## 2 -- The graveyard of good-faith mechanisms Every one of these mechanisms assumes the scraper *wants* to comply. Six years of evidence say they don't. @@ -34,18 +34,18 @@ All of these are **voluntary**, **unsigned**, and **non-binding**. The IETF AI P #### 2.2.a -- Effectiveness Summary | Mechanism | Effectiveness Level | Details | |------------------------------------|--------------------------|-------| -| **ai.txt** | Very Low | Similar to robots.txt, compliance remain minimal. | +| **ai.txt** | Very Low | Similar to robots.txt, compliance remains minimal. | | **TDM Reservation Protocol (TDMRep)** | Low | Intended to reserve rights for text/data, little to no enforcement. | | **C2PA "Do Not Train" Flags** | Low to Moderate | Cryptographically signed metadata, easy to strip or ignore. | | **IETF AI Preferences drafts** | Emerging / Low | No widespread enforcement or adoption by AI labs. | ### 2.3 -- Terms of Service (ToS) -In the case of hiQ v. LinkedIn established that scraping public data is not a CFAA violation. The only remaining claim is breach of ToS, which is theoretically possible but practically ineffective against foundation labs due to standing, damages, jurisdiction, and cost barriers. +In the case of hiQ v. LinkedIn*, the courts established that scraping public data is not a CFAA violation. The only remaining claim is breach of ToS, which is theoretically possible but practically ineffective against foundation labs due to standing, damages, jurisdiction, and cost barriers. #### 2.3.a -- Effectiveness Summary | Mechanism | Effectiveness Level | Details | |----------------------------|---------------------|-------| -| **Terms of Service (ToS)** | Very Low | Not a CFAA violation Hard to enforce, legal presidence. | +| **Terms of Service (ToS)** | Very Low | Not a CFAA violation. Hard to enforce; weak legal precedence. | ### 2.4 -- Opt out by Email OpenAI announced Media Manager in May 2024 for creators to opt out of training, but it has still not launched as of 2026. Stability AI’s pre-SD3 opt-out processed millions of requests, yet they trained on older LAION and Common Crawl data that already contained the images. @@ -64,7 +64,7 @@ Labs that respect current techniques still train on old Common Crawl snapshots c | **Common Crawl Laundering** | Very Low | Old snapshots persist in models, no expiration for aged datasets. | ## 3 -- Why being polite failed -The polite mechanisms aren't failing because we haven't iterated on them enough. They're failing due to the incentive's structured towards rewards ignoring them with no enforcement layer's underneath. The failure isn't a series of bad-faith actors, it's structural and resolve to three(3) main reasons. +The polite mechanisms aren't failing because we haven't iterated on them enough. They're failing due to the incentives structured towards rewards ignoring them with no enforcement layers underneath. The failure isn't a series of bad-faith actors, it's structural and resolves to three main reasons. ### 3.1 -- No enforcement Crawlers face zero cost for ignoring opt-outs, while the producer bears all the costs (bandwidth, CPU, cache pollution, outages). Polite protocols fail due to cost offloading. @@ -108,7 +108,7 @@ Nightshade (from University of Chicago) presents images in CLIP-space to corrupt | **Glaze** | Protects artistic style from AI mimicry | [glaze.cs.uchicago.edu](https://glaze.cs.uchicago.edu/) | ### 4.4 -- Active denial -More aggressive techniques would encompass decompression bombs (tiny gzip files that expand to gigabytes), slow-loris connections that hold open requests for minutes, and deliberately malformed HTML designed to crash parsers. Comments in source which are misleading or negative impacts (prompt injection, misleading links). These are served conditionally based on user-agent. Can pose legal issues and riskier than previous methods, use with caution and seek professional legal consulting. +More aggressive techniques would encompass decompression bombs (tiny gzip files that expand to gigabytes), slow-loris connections that hold open requests for minutes, and deliberately malformed HTML designed to crash parsers. This can include comments in source code, misleading links, or prompt injection attempts. These are served conditionally based on user-agent. Can pose legal issues and riskier than previous methods, use with caution and seek professional legal consulting. #### 4.4.a -- Tooling Most operators should start at low cost such as Anubis or equivalent in front of the origin and only escalate when the bot population adapts as these steps are high-risk methods and may violate laws in some jurisdictions. @@ -119,8 +119,8 @@ Most operators should start at low cost such as Anubis or equivalent in front of | **Slow Loris** | Holds HTTP connections open with minimal data to exhaust server resources | [Slowloris](https://github.com/gkbrk/slowloris) and nginx/lua variants | | **Malformed HTML** | Intentionally broken markup to crash weak parsers in scrapers | Custom server-side logic (no standard open-source tool) | -## 5 -- The objections to the casue -This section acts as 'The devils Advocate' to the concerns outlined previously; user discression is advised. These views may be skewed or bias based on prospective or interpretation. +## 5 -- The objections to the cause +This section acts as the Devil's Advocate to the concerns outlined previously; user discretion is advised. These views may be skewed or biased based on perspective or interpretation. ### 5.1 -- Poisoning is vandalism Poisoning only affects unauthorized copies made by the scraper bots which are training the AI's used around the world. Your original content on your server remains untouched. @@ -141,19 +141,19 @@ Some will try to limit and reduce wasted time/effort tradeoff for cost effective Every filter and dataset cleaning pass is an expensive tax you place on the lab. Paid deals (Reddit, AP, FT) happened because scraping became too costly. Make it more expensive for the lab to keep their status quo and profiting from your content without permission. ### 5.4 -- It's a cat-and-mouse game you can't win. -Correct, but irrelevant. Polite mechanisms aren’t winning either; they’re not even playing. The real choice is between imposing cost on scrapers or imposing zero cost. This is a choice the individual makes and the power of control remains in the creators hands, not the bot or the AI Labs. It's better to be the cat in this game; choose your character. +Correct, but irrelevant. Polite mechanisms aren’t winning either; they’re not even playing. The real choice is between imposing cost on scrapers or imposing zero cost. This is a choice the individual makes and the power of control remains in the creator's hands, not the bot or the AI Labs. It's better to be the cat in this game; choose your character. #### 5.4.a -- Argument in reality -A losing game where the other side pays for every move. This is far better than one where the content creator pay's for everything and the scrapers and AI creators pay nothing in return. The Bot and AI Labs gain profitability on the creative works of the victim. The creators have all the rights to protect and control who/m has accesss to their content. +A losing game where the other side pays for every move. This is far better than one where the content creator pays for everything and the scrapers and AI creators pay nothing in return. The Bot and AI Labs gain profitability on the creative works of the victim. The creators have all the rights to protect and control whom has access to their content. ### 5.5 -- What about the legality Passively serving garbage content (tarpits, poisoned data) to a requesting visitor that is not permitted or has ignored the rules (unauthorized access) is not illegal in itself. You’re only returning what they asked for. The legal risk is the same as serving a slow page. -#### 5.4.a -- Argument in reality +#### 5.5.a -- Argument in reality Active attacks like decompression bombs or parser exploits are riskier and would highly recommend seeking legal counsel and proceed with caution as an individual. Bright line: serve garbage, don’t attack; it's on the bot and lab to filter and clean their sources content, not you. ## 6 -- What can you do over this weekend -The following represents quick and easy to accomplish set of protections which cover both commercial closed and community driven opensource projects. In conjunction these solutions add layers of defense towards the fight against scraper bots used to train AI models. Some solutions listed are paid services, the Church of Malware(CoM) is not associated with, nor directly endorcing these solutions; however, documenting their useful fight against AI bots and the AI labs continual abuse. +The following represents quick and easy to accomplish set of protections which cover both commercial closed and community driven opensource projects. In conjunction these solutions add layers of defense towards the fight against scraper bots used to train AI models. Some solutions listed are paid services, the Church of Malware(CoM) is not associated with, nor directly endorsing these solutions; however, documenting their useful fight against AI bots and the AI labs' continued abuse. ### 6.1 -- Commercial (Free Tier) Solutions @@ -174,7 +174,7 @@ The following represents quick and easy to accomplish set of protections which c ### 6.3 -- Recommended Free Stack - Start with **Cloudflare Free** (easiest/commercial) - Integrated with self-hosted: **Anubis** + **Nepenthes** -- Incorportate web server rate limiting (nginx) +- Incorporate web server rate limiting (nginx) #### 6.3.a -- NGINX rate limiting example ``` @@ -185,7 +185,7 @@ The following represents quick and easy to accomplish set of protections which c ``` ### 6.4 -- If you run a small site / business -Deploy Anubis (or equivalent PoW wall) in front of your origin. It's a simple nginx reverse-proxy setup. Additionaly, add a Nepenthes tarpit behind a `Disallow` path in robots.txt which has versitile deployment options, (Baremetal/Docker deployment). Monitoring the logs should show some positive impact as most sites see a sharp drop in bot traffic and bandwidth costs. +Deploy Anubis (or equivalent PoW wall) in front of your origin. It's a simple nginx reverse-proxy setup. Additionally, add a Nepenthes tarpit behind a `Disallow` path in robots.txt which has versatile deployment options, (Baremetal/Docker deployment). Monitoring the logs should show some positive impact as most sites see a sharp drop in bot traffic and bandwidth costs. ### 6.5 -- If you're a content creator Assume every image will be ingested and make ingestion expensive for these AI bots but accessible to the human consumer. For text, use distinctive phrasing and deliberate canary strings as watermarks. Keep a private log of what you published and when. Proof of ingestion is essential for any future legal remedy in case your content has been consumed and regenerated by some model. @@ -195,13 +195,13 @@ Assume every image will be ingested and make ingestion expensive for these AI bo - Run **Nightshade** on images you want strong protection against fine-tuning. ### 6.6 -- If you're a publisher, university, or cultural institution -Stop negotiating licensing from weakness. The major deals signed in 2024–2025 (Reddit, AP, News Corp, Axel Springer) happened only after scraping became expensive or legally risky. Make your content expensive to scrape first then licening second as leverage for negotiation. Labs won’t pay for what they can take for free, don't willingly participate as the victim. +Stop negotiating licensing from weakness. The major deals signed in 2024–2025 (Reddit, AP, News Corp, Axel Springer) happened only after scraping became expensive or legally risky. Make your content expensive to scrape first, then license second as leverage. Labs won’t pay for what they can take for free, don't willingly participate as the victim. ## 7 -- The reframe is the whole point Opt-out is begging or asking the powerful actors with no incentive to comply to *please* listen. Poisoning is bargaining; you impose a real cost they must either pay or work around. Polite mechanisms failed because they assumed good faith from actors whose entire business model depends on its absence. The next decade of the open web depends on operators realizing the bargaining power they’ve always had is sitting in their own server config, ready to be used. ## 8 -- References -The intention of section is to capture the references consumed and paraphrased in-order to produce this publication to aid the reader with additional information and resources useful for the acidemic research and study of the underlying topics discusssed within this document. +The intention of this section is to capture the references consumed and paraphrased in-order to produce this publication to aid the reader with additional information and resources useful for the academic research and study of the underlying topics discussed within this document. ### 8.1 -- Section 1: Documented incidents | Section | Claim | Source |