Give the agent recall of things said beyond the verbatim window, without
breaking the RAM-only philosophy — nothing is persisted to disk.
- MemoryIndex: a capped, in-memory pool of embedded messages with pure-Python
cosine search (no numpy). Retains far more than the rolling transcript so old
lines can be surfaced on demand; oldest evicted past the cap to bound RAM.
- OllamaEmbedder: local embeddings via nomic-embed-text, on by default and
independent of the chat provider (reuses the Ollama host when chat is Ollama).
- Bridge: captured room messages (live + backfilled) are embedded on a
background worker so a slow embedder can't stall frame draining. On a /ai
question the agent retrieves top-k relevant lines, drops weak (<min_score) and
windowed-duplicate hits, and prepends them as a clearly-fenced "recalled
context" preamble — kept at user role, never elevated to system, so untrusted
room text informs without instructing. Falls back to recency-only if the
embedder is unreachable.
- CLI: --no-rag, --embed-model, --embed-host, --rag-top-k.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Roadmap for deepening the /ai agent's conversational context while keeping
the RAM-only philosophy, plus Ollama latency wins. Marks Tier 1 (backfill,
token-budget window) and the perf tuning as in-scope now; RAG and in-RAM
compaction staged next. Grounded in public Anthropic docs, not leaked source.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>