# Exploding Archives: Decompression Bombs for Deterring Unauthorized AI Ingestion

The Church of Malware (CoM) does not condone the use or introduction of explosive substances onto any individual, human, or animal; however, AI is neither natural, a human, nor actual intelligence. This document extends the discussion of active denial techniques for individual content creators seeking to protect their websites, images, videos, songs, and other creative works from unauthorized scraping by AI training pipelines.

## 1 -- Technical Foundation of Decompression Bombs

A decompression bomb, also known as a zip bomb, tar bomb, or gzip bomb, is an archive file engineered such that its compressed representation is minimal in size, yet its expanded form consumes disproportionate system resources (CPU, memory, disk). This asymmetry is achieved through recursive nesting, highly repetitive compressible data, or pathological compression structures that exploit the decompression algorithm's behavior.

### 1.1 -- Mechanism of Action
Modern compression formats (ZIP, GZIP, XZ, TAR) use algorithms like DEFLATE, LZMA, or Burrows-Wheeler that achieve high ratios on repetitive input. A bomb typically consists of:
- A small outer archive containing a slightly larger inner archive.
- Recursion to arbitrary depth (e.g., 42 levels in the classic 42.zip).
- Or a single file containing a pattern (e.g., 1 GB of zeros compressed to <1 KB).

When a non-compliant crawler or ingestion pipeline blindly decompresses the payload—common in dataset curation scripts that normalize downloads—the process either:
- Exhausts available RAM (out-of-memory kill).
- Fills temporary storage.
- Consumes CPU cycles for an extended period.
- Crashes the parser or corrupts the training batch.

This directly impacts the scraper's cost model without affecting the original content served to compliant or human visitors.

### 1.2 -- Historical Precedence and Variants
The concept traces to early antivirus and mail-server DoS vectors in the late 1990s. The 42.zip (42.0 KB compressed → 4.5 PB uncompressed) remains the canonical example. Variants include:
- **Gzip bombs**: Single-stream gzip of repetitive data.
- **Tar bombs**: Nested tarballs or tar with symlink/path traversal elements.
- **Nested multi-format**: ZIP containing GZ containing TAR, etc.

For AI-specific application, the bomb is served only to user-agents matching known aggressive scrapers (e.g., Bytespider, GPTBot variants, or generic headless clients) when they violate `Disallow` directives.

## 2 -- Application to Individual Content Protection

Individual creators—bloggers, photographers, musicians, independent filmmakers—host creative output on personal domains, GitHub Pages, self-hosted VPSes, or static-site generators. The goal is to impose asymmetric cost on unauthorized ingestion while preserving accessibility for humans and whitelisted archival bots.

### 2.1 -- Integration with Existing Defense Layers
Place the bomb file behind a `Disallow: /bomb/` path in `robots.txt`. Use server-side logic (nginx map, Apache RewriteMap, or Cloudflare Worker) to inspect the `User-Agent` header:

- If UA matches known AI crawler list → serve `bomb.zip` (or appropriate extension) with `Content-Type: application/zip`.
- Otherwise → serve normal 404, tarpit, or legitimate content.

This ensures compliant crawlers (Googlebot, Bingbot, Internet Archive) never encounter the payload.

### 2.2 -- Protecting Different Media Types
- **Websites / HTML content**: Embed or link to bomb files in hidden sections or disallowed paths; scrapers that ignore robots.txt and parse links will trigger.
- **Images**: While Nightshade/Glaze are preferred for pixel-level poisoning, a site can serve a "downloadable high-res archive" that is actually a bomb for bot UAs.
- **Videos / Songs**: Offer "lossless FLAC/TIFF archives" or "project files (.zip of stems)" that expand to multi-gigabyte datasets when decompressed by automated pipelines.
- **Creative code / datasets**: Any downloadable artifact can be replaced conditionally.

The individual retains full control: the original files on disk remain untouched. Only the response to unauthorized requests is altered.

### 2.3 -- Generation for Individuals (Low-Cost Methods)
No specialized hardware required. On a personal machine or VPS:

```bash
# Simple recursive gzip bomb (example; scale depth carefully)
python3 -c "
import gzip, os
data = b'A' * 1024 * 1024  # 1 MiB
for i in range(10):
    data = gzip.compress(data)
with open('bomb.gz', 'wb') as f:
    f.write(data)
print('Generated bomb.gz')
"
```

Tools such as `zip-bomb` generators or the referenced GitHub project allow parameter tuning (depth, ratio, target expansion size). Typical output: <100 KB file that expands to >10 GB.

For tar variants, use `tar` with `--append` in a loop or Python's `tarfile` module with crafted members.

## 3 -- Effectiveness and Limitations

| Criterion                  | Rating          | Notes |
|----------------------------|-----------------|-------|
| **Bot traffic reduction**  | High (when triggered) | Directly terminates ingestion jobs; one successful hit can waste hours of crawler time. |
| **Implementation difficulty** | Medium         | Requires conditional serving logic; simpler than Nightshade for non-image content. |
| **Human impact**           | None            | Never served to whitelisted agents or direct browser visits. |
| **Persistence in models**  | N/A             | Not poisoning; causes pipeline failure rather than data corruption. |
| **Legal surface**          | Elevated        | Active resource-exhaustion vectors; consult counsel. See Section 5. |

Empirical reports from tarpit deployments (Nepenthes, Iocaine) show 70-90% drop in malicious crawler activity within days; decompression bombs amplify the cost per request for any pipeline that decompresses.

## 4 -- Objections and Rebuttals

### 4.1 -- "This is an active attack, not passive defense"
**Rebuttal**: The response is strictly conditional on policy violation (ignoring `robots.txt`). The server returns exactly what the requester asked for under the URI they fetched. No unsolicited packets or exploits are sent. This mirrors the legal bright line drawn in Section 5.5 of the primary dissertation.

### 4.2 -- "Scrapers will simply skip unknown file types or add decompression guards"
**Rebuttal**: Dataset curation at scale still relies on generic `requests.get()` + `zipfile` / `tarfile` / `gzip` pipelines for many content types. Adding per-format guards increases engineering cost—the exact economic signal we seek to impose. Future adaptive scrapers will pay the tax; individuals benefit from the delay.

### 4.3 -- "Risk of collateral damage to legitimate researchers"
**Rebuttal**: Gate strictly behind verified-bot detection and `Disallow` paths. Maintain an explicit whitelist (reverse-DNS for major engines + published IA ranges). Individuals who publish scientific data can additionally expose an `ai.txt` or signed C2PA metadata allowing opt-in ingestion for approved actors.

## 5 -- Implementation Recommendations for Individuals

1. Start with passive layers (Anubis + Nepenthes) before escalating.
2. Generate 2-3 bomb variants (zip, gz, tar) and rotate them.
3. Log hits (with UA and IP, respecting privacy) to quantify impact.
4. Combine with canary tokens: include unique strings in the bomb metadata so any regurgitation in model output can be traced.
5. For video/audio creators: embed the bomb as an "alternate download" in RSS feeds or hidden `<link>` tags that only aggressive parsers follow.

## 6 -- References

| Section | Claim | Source |
|---------|-------|--------|
| 1.1 | Recursive zip bomb construction and expansion ratios | "Zip Bomb" Wikipedia; original 42.zip analysis (various mirrors, 2000s) |
| 1.2 | Historical mail-server and AV DoS via decompression | CERT advisories CA-1999-07, CA-2000-12 (archived) |
| 2.1 | Conditional serving via User-Agent in nginx | nginx `map` directive documentation; example configs in Anubis deployments |
| 2.3 | Python gzip bomb generator pattern | Adapted from public gist examples and `zip-bomb` project (https://github.com/iamtraction/zip-bomb) |
| 3 | Tarpit effectiveness metrics | Ars Technica coverage of Nepenthes/Iocaine (Jan 2025) |
| 4.1 | Legal framing of "serve what is requested" | Consistent with hiQ v. LinkedIn analysis and CFAA public scraping precedent |

## 7 -- Conclusion

Decompression bombs represent a high-leverage, low-maintenance tool in the individual creator's arsenal. By exploiting the economic asymmetry of ingestion pipelines, they convert a single HTTP request into hours of wasted compute for non-compliant actors. Used responsibly behind proper gating and with legal awareness, they reinforce the central thesis: the power to protect creative output resides with the creator, not the scraper.

*Document prepared in the style and spirit of the primary dissertation "When Being Polite Fails, Try Poison". Individual operators are advised to review local laws and consult qualified legal counsel before deployment of active denial measures.*