Add techniques/canaries/technical_honey_canary.md

2026-06-04 16:34:29 +00:00
parent be48218690
commit 2ee8bf470d
1 changed files with 40 additions and 0 deletions
@@ -0,0 +1,40 @@
+# Canary Tokens & Honeytokens — Attribution and Early Warning for Content Theft
+
+**Canary tokens** and **honeytokens** are unique, high-entropy identifiers embedded in content or responses. When they later appear in model output, public datasets, or trigger external requests, the creator receives immediate, cryptographically verifiable proof of ingestion. This technique turns passive content into active sensors that detect unauthorized use long after scraping occurs.
+
+## Why Canary Tokens Matter
+
+Most existing defenses (PoW walls, tarpits, rate limiting) are real-time. They protect the live site but provide no evidence once data has left the origin. Canary tokens solve the attribution problem. A single unique string published in 2026 can later appear in a model’s training corpus or generated output, giving the individual creator concrete proof that their work was consumed without permission.
+
+This directly supports future legal or regulatory claims under frameworks such as the EU AI Act Article 53 and strengthens the economic argument that scraping has real, traceable consequences.
+
+## How It Fits the Defense Stack
+
+1. **Anubis** (`anubis.md`) - First filter (computational challenge).
+2. **Nepenthes** (`nepenthes-tarpit.md`) - Second filter (resource trap).
+3. **Active denial** (bombs, malformed content, slowloris) - Third layer (immediate cost imposition).
+4. **Fail2Ban + rate limiting** - Enforcement layer.
+5. **Canary tokens & honeytokens** (this document) - Attribution and long-term evidence layer.
+
+Canary tokens are the only technique in the stack that continues to provide value even after the data has been stolen. They close the loop between real-time protection and post-ingestion accountability.
+
+## Key Benefits for Individuals
+
+- **Cryptographic proof** - High-entropy strings are statistically impossible to guess; their presence is irrefutable evidence.
+- **Zero maintenance after publication** - Once embedded, tokens require no further action until triggered.
+- **Multi-format coverage** - Works in HTML, images (steganography), audio metadata, video subtitles, PDFs, and JSON-LD.
+- **Early warning** - Hidden links or DNS-based tokens can alert the creator the moment a bot follows them.
+- **Legal leverage** - Provides concrete data points for DMCA notices, regulatory complaints, or future litigation.
+
+## Types of Canary Tokens
+
+- **Static strings** - Unique phrases or UUIDs embedded in text or metadata.
+- **Hidden links** - Invisible or low-contrast `<a>` tags that only aggressive parsers follow.
+- **DNS / web bug tokens** - Unique subdomains that fire when a bot resolves them.
+- **Steganographic tokens** - Data hidden inside images or audio files.
+
+## Recommended Starting Point
+
+Begin by adding unique canary strings to every high-value page, image sidecar, and audio metadata file. Combine with the aggressive-bot conditional serving logic so that only violators receive the most sensitive tokens. Over time, maintain a private ledger of published canaries and their dates, this log becomes invaluable evidence.
+
+*Canary tokens and honeytokens provide the missing attribution layer. They ensure that even successful scraping leaves a detectable signature that can be used for accountability.*