115 lines
9.5 KiB
Markdown
115 lines
9.5 KiB
Markdown
# Holding the Line: Slowloris-Style Resource Exhaustion for AI Scraper Deterrence
|
||
|
||
The Church of Malware (CoM) does not condone the use or introduction of primates onto any individual, human, or animal; however, AI is neither natural, a human, nor actual intelligence. This document details connection-holding and request-slowing techniques that enable individual creators to impose sustained time and bandwidth costs on unauthorized AI crawlers targeting personal websites, image galleries, video channels, audio libraries, and other creative repositories.
|
||
|
||
## 1 -- Technical Overview of Connection Exhaustion Techniques
|
||
|
||
Slowloris, originally published in 2009, is a low-bandwidth denial-of-service technique that opens many partial HTTP connections and keeps them alive by sending incomplete requests at a very slow rate (bytes per minute). The target server's connection table fills while legitimate traffic is starved.
|
||
|
||
In the context of content protection, the polarity is reversed: the *origin server* deliberately slows or fragments responses exclusively to non-compliant user-agents. The effect on the scraper is identical—its worker threads or connection pools are tied up for minutes per request—while the creator's bandwidth cost remains near zero.
|
||
|
||
### 1.1 -- Variants Applicable to Individual Operators
|
||
- **Classic Slowloris response**: Server accepts the request but transmits the response body at ~1 byte/second.
|
||
- **Partial header / chunked encoding abuse**: Send HTTP/1.1 200 OK with `Transfer-Encoding: chunked` and emit chunks on a timer.
|
||
- **Keep-alive with zero-length body**: Maintain the TCP connection open after headers, forcing the client to wait for a timeout or RST.
|
||
- **Application-level tarpit integration**: Combine with Nepenthes/Iocaine so that disallowed paths return a dynamically generated, infinitely scrolling "page" delivered at human-readable speed.
|
||
|
||
These methods require no additional hardware and can be implemented in nginx (lua-resty or njs), Caddy, or a lightweight Python/Go reverse proxy in front of static content.
|
||
|
||
### 1.2 -- Why AI Pipelines Are Particularly Vulnerable
|
||
Modern AI ingestion systems are optimized for throughput:
|
||
- High concurrency (hundreds of parallel workers).
|
||
- Short timeouts on individual requests.
|
||
- Reliance on connection reuse and HTTP/2 multiplexing.
|
||
|
||
A single slow response can block an entire worker for the duration of the timer (commonly 30–120 seconds). At scale, this multiplies into significant cloud billing spikes or job queue backlogs—exactly the economic signal described in the primary dissertation.
|
||
|
||
## 2 -- Protecting Individual Creative Output
|
||
|
||
### 2.1 -- Website and Text Content
|
||
Personal blogs, academic homepages, and portfolio sites are frequent targets. By rate-limiting or slow-serving only the paths listed in `Disallow`, the creator ensures that exploratory crawlers waste connection slots while human readers and search-engine bots receive normal performance.
|
||
|
||
### 2.2 -- Image, Video, and Audio Galleries
|
||
- Photographers can place high-resolution "originals.zip" or "RAW archive" links behind bot-only logic.
|
||
- Filmmakers and musicians can serve "director’s cut" or "stem packs" as slow-downloading resources.
|
||
- The actual media files remain fast for humans; only the conditional response to violators is throttled.
|
||
|
||
Because the technique operates at the HTTP layer, it is media-agnostic and works equally for static files, dynamically generated manifests, or streaming endpoints.
|
||
|
||
### 2.3 -- Practical Deployment for Non-Experts
|
||
|
||
#### nginx + lua example (excerpt)
|
||
```nginx
|
||
map $http_user_agent $is_ai_bot {
|
||
default 0;
|
||
~*GPTBot|ClaudeBot|Bytespider|Perplexity 1;
|
||
}
|
||
|
||
server {
|
||
location /protected/ {
|
||
if ($is_ai_bot) {
|
||
access_log /var/log/nginx/ai_slow.log;
|
||
limit_rate 1k; # 1 KB/s throttle
|
||
# or invoke lua slow-response handler
|
||
}
|
||
try_files $uri =404;
|
||
}
|
||
}
|
||
```
|
||
|
||
Similar patterns exist for Apache (mod_ratelimit + mod_rewrite), Caddyfile `respond` with streaming, and Cloudflare Workers (edge-side throttling for free-tier users).
|
||
|
||
## 3 -- Effectiveness Summary
|
||
|
||
| Criterion | Rating | Details |
|
||
|------------------------------|--------------------|---------|
|
||
| **Per-request cost imposed** | Very High | 30–300× normal request duration; multiplies with concurrency. |
|
||
| **Bandwidth cost to creator**| Negligible | <1 KB/s per throttled connection. |
|
||
| **Difficulty for individuals**| Low–Medium | nginx/Caddy config or 20-line Python proxy. |
|
||
| **Detectability by labs** | Medium | Labs may add per-host timeouts or UA rotation; rotating the tarpit logic defeats simple filters. |
|
||
| **Compatibility with polite layers** | High | Works alongside Anubis PoW and robots.txt tarpits. |
|
||
|
||
Real-world tarpit deployments have demonstrated 85–95% reduction in sustained crawler activity; adding deliberate slowness further degrades the signal-to-noise ratio of any stolen corpus.
|
||
|
||
## 4 -- Known Aggressive Bot User-Agents (June 2026)
|
||
|
||
The patterns below are documented across Cloudflare Radar, Originality.AI studies, Wired investigations, and operator reports as routinely violating `robots.txt`, using undeclared agents, or rotating identifiers on commercial cloud ranges. Individual creators should copy these patterns into their nginx `map`, Caddy rewrite rules, or Cloudflare Worker logic when conditionally serving decompression bombs, slow responses, or malformed content.
|
||
|
||
| User-Agent Pattern | Primary Operator | Documented Violations | Recommended Action for Individuals |
|
||
|----------------------------------------|-----------------------|------------------------------------------------------------|------------------------------------|
|
||
| `GPTBot*` / `GPT-4*` / `OAI-SearchBot*` | OpenAI | Ignores robots.txt; undeclared AWS crawlers after disallow | Block or serve bomb |
|
||
| `ClaudeBot*` / `anthropic-ai*` | Anthropic | ~1M hits/24h on iFixit; five-figure bandwidth abuse | Block or serve bomb |
|
||
| `Bytespider*` / `ByteDance*` | ByteDance | Frequent robots.txt bypass; UA/IP rotation | Block or serve bomb |
|
||
| `Perplexity*` / `PerplexityBot*` | Perplexity | Undeclared AWS IP range after explicit disallow | Block or serve bomb |
|
||
| `Google-Extended*` | Google | Inconsistent opt-out honoring for training | Rate-limit or whitelist |
|
||
| `CCBot*` | Common Crawl | Old snapshots persist; no retroactive effect | Conditional / monitor |
|
||
| `Amazonbot*` | Amazon | Aggressive crawling on small/personal sites | Rate-limit |
|
||
| `Applebot*` | Apple | Generally compliant but monitor for volume | Monitor / whitelist |
|
||
| `Meta-ExternalAgent*` / `facebook*` | Meta | Variable compliance on disallowed paths | Rate-limit |
|
||
| `*headless*` / generic Playwright/Puppeteer | Third-party scrapers | No declaration; high volume on tarpit/disallowed paths | Serve bomb immediately |
|
||
|
||
**Implementation note for individuals**: Combine the list with reverse-DNS verification for major engines and maintain an explicit allow-list (Internet Archive published ranges, academic researchers, search engines you wish to support). Update quarterly.
|
||
|
||
## 5 -- Legal and Ethical Boundary
|
||
|
||
As emphasized in Section 5.5 of the primary work, the technique remains within the "serve exactly what was requested" envelope. No spoofed packets, no amplification, and no unsolicited outbound connections are involved. The elevated risk classification stems from the intentional resource commitment on the client side; individuals should therefore:
|
||
|
||
- Maintain detailed logs of UA, timestamp, and path.
|
||
- Provide an easy opt-out path (email or `ai.txt` directive) for legitimate researchers.
|
||
- Consider jurisdictional safe-harbor language in the site footer.
|
||
|
||
## 6 -- References and Further Reading
|
||
|
||
| Section | Claim | Source |
|
||
|---------|-------|--------|
|
||
| 1 | Original Slowloris disclosure | RSnake (hackers.org), 2009 (archived) |
|
||
| 1.1 | Chunked-encoding tarpit implementations | "Slow HTTP attacks" literature; nginx lua-resty examples |
|
||
| 2.3 | Practical nginx rate-limit + map usage | nginx documentation; community configs from Anubis adopters |
|
||
| 3 | Measured impact of tarpits on AI crawlers | Ars Technica, "AI haters build tarpits..." (28 Jan 2025) |
|
||
| 4 | Legal framing | Consistent with primary dissertation references (hiQ v. LinkedIn, EU AI Act Art. 53) |
|
||
|
||
## 7 -- Conclusion for the Individual Creator
|
||
|
||
Slowloris-style response throttling is one of the most accessible active-denial tools available to independent artists and writers. It requires only modest configuration changes, preserves the integrity of the original creative files, and directly attacks the economics of mass scraping. When layered with proof-of-work walls and verified-bot whitelisting, it forms a robust, low-cost perimeter that returns control to the content owner.
|
||
|
||
*This document follows the scientific and structural conventions established in "When Being Polite Fails, Try Poison". Deployment of connection-exhaustion methods should be preceded by legal review in the operator's jurisdiction.* |