Roadmap for deepening the /ai agent's conversational context while keeping the RAM-only philosophy, plus Ollama latency wins. Marks Tier 1 (backfill, token-budget window) and the perf tuning as in-scope now; RAG and in-RAM compaction staged next. Grounded in public Anthropic docs, not leaked source. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3.4 KiB
3.4 KiB
AI agent: context & local-performance plan
How the /ai agent gets conversational context, how to deepen it without
breaking the RAM-only philosophy, and how to make the local (Ollama) path
faster. Everything here stays in process memory — no disk persistence of
conversation data, no embeddings on disk. Context dies with the agent process,
exactly like the room itself.
Current state (baseline)
AgentBridge.transcript: list[Msg]— one flat in-RAM list (bridge.py).- Passive capture: every non-addressed line is appended as
Msg("user", "sender: text"), trimmed tocontext_window * 2(24) messages. - On
/ai: sendssystem_prompt + transcript[-context_window:](last 12). - Sandbox: same last-12 window + the task.
OllamaProvider.completepostsstream=False, nooptions, nokeep_alive.
Limitations
- Recency-only window — no relevance; old context is dropped forever.
- Join amnesia — the agent only knows messages seen since connecting,
even though the server already sends it the full backlog and the bridge
throws it away (
inithandler reads onlyusers). - Message-count budget, not token budget — fragile on small models.
- Flat, untyped transcript — all senders flattened to role
user. stream=False+ cold model — high perceived latency.
Key enabling fact
The server keeps the last 1000 (encrypted) messages in RAM (MessageStore,
stores.py) and ships them all in the init frame (helpers.send_state).
That is a RAM-only history the agent can backfill from on join at zero new cost.
Plan
Tier 1 — context foundation (this branch)
- Backfill on join. Consume
init.messages: decrypt withroom_fernet, drop control frames ({"_…) and our own lines, append totranscript, trim to budget. Pure RAM, ephemeral. (implementing) - Token-budget windowing. Replace the fixed
[-12:]slices with a tail-by-token-budget selector (char/4 estimate), capped by a max message count. Used by both the answer and sandbox paths. (implementing)
Tier 1.5 — local performance (this branch)
- Pin the model in VRAM via Ollama
keep_aliveto kill cold-reload stalls. - Tune Ollama
options— explicitnum_ctx(so the larger window in #1/#2 is actually honored) and boundednum_predict. (implementing)
Tier 2 — deeper context (next branch)
- In-RAM semantic retrieval (RAG, no disk). Embed each captured message
with the already-present
nomic-embed-text, hold vectors in a numpy array in memory; on a/aiquestion retrieve top-k by cosine and prepend to the recency window. Fully ephemeral. - In-RAM hierarchical compaction. When over budget, summarize the oldest
chunk into a single rolling
Msg("system", "earlier: …")instead of dropping it — the Claude Code auto-compaction pattern, kept in RAM.
Tier 3 — latency & throughput (next branch)
- Token streaming to the room (incremental chat frames) so replies appear as they generate.
- Stable prompt prefix (system + summary + retrieved block in fixed order) for Ollama KV-cache reuse across turns.
- Single-flight queue so concurrent
/aicalls don't pile threads onto one Ollama instance.
Notes on provenance
All patterns above are grounded in Anthropic's public documentation (context compaction, prompt caching, token-budgeted assembly) and the open Agent SDK — no leaked/proprietary source was used.