From bbb9e824252f16c98bb2aa26da6b6d4032507bfa Mon Sep 17 00:00:00 2001
From: leetcrypt <leetcrypt@users.noreply.github.com>
Date: Tue, 2 Jun 2026 17:43:02 -0700
Subject: [PATCH] docs: plan for AI agent context + local-perf improvements

Roadmap for deepening the /ai agent's conversational context while keeping
the RAM-only philosophy, plus Ollama latency wins. Marks Tier 1 (backfill,
token-budget window) and the perf tuning as in-scope now; RAG and in-RAM
compaction staged next. Grounded in public Anthropic docs, not leaked source.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 docs/ai-context-plan.md | 66 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)
 create mode 100644 docs/ai-context-plan.md

diff --git a/docs/ai-context-plan.md b/docs/ai-context-plan.md
new file mode 100644
index 0000000..8a5b3f2
--- /dev/null
+++ b/docs/ai-context-plan.md
@@ -0,0 +1,66 @@
+# AI agent: context & local-performance plan
+
+How the `/ai` agent gets conversational context, how to deepen it **without
+breaking the RAM-only philosophy**, and how to make the local (Ollama) path
+faster. Everything here stays in process memory — no disk persistence of
+conversation data, no embeddings on disk. Context dies with the agent process,
+exactly like the room itself.
+
+## Current state (baseline)
+
+- `AgentBridge.transcript: list[Msg]` — one flat in-RAM list (`bridge.py`).
+- Passive capture: every non-addressed line is appended as `Msg("user", "sender: text")`, trimmed to `context_window * 2` (24) messages.
+- On `/ai`: sends `system_prompt + transcript[-context_window:]` (last 12).
+- Sandbox: same last-12 window + the task.
+- `OllamaProvider.complete` posts `stream=False`, no `options`, no `keep_alive`.
+
+### Limitations
+1. **Recency-only window** — no relevance; old context is dropped forever.
+2. **Join amnesia** — the agent only knows messages seen since connecting,
+   **even though the server already sends it the full backlog** and the bridge
+   throws it away (`init` handler reads only `users`).
+3. **Message-count budget, not token budget** — fragile on small models.
+4. **Flat, untyped transcript** — all senders flattened to role `user`.
+5. **`stream=False` + cold model** — high perceived latency.
+
+### Key enabling fact
+The server keeps the last 1000 (encrypted) messages in RAM (`MessageStore`,
+`stores.py`) and ships them all in the `init` frame (`helpers.send_state`).
+That is a RAM-only history the agent can backfill from on join at zero new cost.
+
+## Plan
+
+### Tier 1 — context foundation (this branch)
+1. **Backfill on join.** Consume `init.messages`: decrypt with `room_fernet`,
+   drop control frames (`{"_…`) and our own lines, append to `transcript`,
+   trim to budget. Pure RAM, ephemeral. *(implementing)*
+2. **Token-budget windowing.** Replace the fixed `[-12:]` slices with a
+   tail-by-token-budget selector (char/4 estimate), capped by a max message
+   count. Used by both the answer and sandbox paths. *(implementing)*
+
+### Tier 1.5 — local performance (this branch)
+6. **Pin the model in VRAM** via Ollama `keep_alive` to kill cold-reload stalls.
+8. **Tune Ollama `options`** — explicit `num_ctx` (so the larger window in #1/#2
+   is actually honored) and bounded `num_predict`. *(implementing)*
+
+### Tier 2 — deeper context (next branch)
+3. **In-RAM semantic retrieval (RAG, no disk).** Embed each captured message
+   with the already-present `nomic-embed-text`, hold vectors in a numpy array in
+   memory; on a `/ai` question retrieve top-k by cosine and prepend to the
+   recency window. Fully ephemeral.
+4. **In-RAM hierarchical compaction.** When over budget, summarize the oldest
+   chunk into a single rolling `Msg("system", "earlier: …")` instead of dropping
+   it — the Claude Code auto-compaction pattern, kept in RAM.
+
+### Tier 3 — latency & throughput (next branch)
+5. **Token streaming** to the room (incremental chat frames) so replies appear
+   as they generate.
+7. **Stable prompt prefix** (system + summary + retrieved block in fixed order)
+   for Ollama KV-cache reuse across turns.
+9. **Single-flight queue** so concurrent `/ai` calls don't pile threads onto one
+   Ollama instance.
+
+## Notes on provenance
+All patterns above are grounded in Anthropic's **public** documentation (context
+compaction, prompt caching, token-budgeted assembly) and the open Agent SDK —
+no leaked/proprietary source was used.