Give the agent recall of things said beyond the verbatim window, without breaking the RAM-only philosophy — nothing is persisted to disk. - MemoryIndex: a capped, in-memory pool of embedded messages with pure-Python cosine search (no numpy). Retains far more than the rolling transcript so old lines can be surfaced on demand; oldest evicted past the cap to bound RAM. - OllamaEmbedder: local embeddings via nomic-embed-text, on by default and independent of the chat provider (reuses the Ollama host when chat is Ollama). - Bridge: captured room messages (live + backfilled) are embedded on a background worker so a slow embedder can't stall frame draining. On a /ai question the agent retrieves top-k relevant lines, drops weak (<min_score) and windowed-duplicate hits, and prepends them as a clearly-fenced "recalled context" preamble — kept at user role, never elevated to system, so untrusted room text informs without instructing. Falls back to recency-only if the embedder is unreachable. - CLI: --no-rag, --embed-model, --embed-host, --rag-top-k. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3.8 KiB
3.8 KiB
AI agent: context & local-performance plan
How the /ai agent gets conversational context, how to deepen it without
breaking the RAM-only philosophy, and how to make the local (Ollama) path
faster. Everything here stays in process memory — no disk persistence of
conversation data, no embeddings on disk. Context dies with the agent process,
exactly like the room itself.
Current state (baseline)
AgentBridge.transcript: list[Msg]— one flat in-RAM list (bridge.py).- Passive capture: every non-addressed line is appended as
Msg("user", "sender: text"), trimmed tocontext_window * 2(24) messages. - On
/ai: sendssystem_prompt + transcript[-context_window:](last 12). - Sandbox: same last-12 window + the task.
OllamaProvider.completepostsstream=False, nooptions, nokeep_alive.
Limitations
- Recency-only window — no relevance; old context is dropped forever.
- Join amnesia — the agent only knows messages seen since connecting,
even though the server already sends it the full backlog and the bridge
throws it away (
inithandler reads onlyusers). - Message-count budget, not token budget — fragile on small models.
- Flat, untyped transcript — all senders flattened to role
user. stream=False+ cold model — high perceived latency.
Key enabling fact
The server keeps the last 1000 (encrypted) messages in RAM (MessageStore,
stores.py) and ships them all in the init frame (helpers.send_state).
That is a RAM-only history the agent can backfill from on join at zero new cost.
Plan
Tier 1 — context foundation (this branch)
- Backfill on join. Consume
init.messages: decrypt withroom_fernet, drop control frames ({"_…) and our own lines, append totranscript, trim to budget. Pure RAM, ephemeral. (implementing) - Token-budget windowing. Replace the fixed
[-12:]slices with a tail-by-token-budget selector (char/4 estimate), capped by a max message count. Used by both the answer and sandbox paths. (implementing)
Tier 1.5 — local performance (this branch)
- Pin the model in VRAM via Ollama
keep_aliveto kill cold-reload stalls. - Tune Ollama
options— explicitnum_ctx(so the larger window in #1/#2 is actually honored) and boundednum_predict. (implementing)
Tier 2 — deeper context
- In-RAM semantic retrieval (RAG, no disk). (done) Each captured message
is embedded with the already-present
nomic-embed-textand held in a capped in-memoryMemoryIndex(pure-Python cosine, no numpy). On a/aiquestion the agent embeds the query, retrieves top-k, drops weak/duplicate hits, and prepends them as a clearly-fenced "recalled context" preamble (never system role — keeps untrusted text from instructing). Embedding runs on a background worker so it can't stall the recv loop; if the embedder is unreachable it degrades to recency-only. Toggle with--no-rag/--rag-top-k. - In-RAM hierarchical compaction. (staged) When over budget, summarize the oldest
chunk into a single rolling
Msg("system", "earlier: …")instead of dropping it — the Claude Code auto-compaction pattern, kept in RAM.
Tier 3 — latency & throughput (next branch)
- Token streaming to the room (incremental chat frames) so replies appear as they generate.
- Stable prompt prefix (system + summary + retrieved block in fixed order) for Ollama KV-cache reuse across turns.
- Single-flight queue so concurrent
/aicalls don't pile threads onto one Ollama instance.
Notes on provenance
All patterns above are grounded in Anthropic's public documentation (context compaction, prompt caching, token-budgeted assembly) and the open Agent SDK — no leaked/proprietary source was used.