hack-house/docs/ai-context-plan.md
leetcrypt bbb9e82425 docs: plan for AI agent context + local-perf improvements
Roadmap for deepening the /ai agent's conversational context while keeping
the RAM-only philosophy, plus Ollama latency wins. Marks Tier 1 (backfill,
token-budget window) and the perf tuning as in-scope now; RAG and in-RAM
compaction staged next. Grounded in public Anthropic docs, not leaked source.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-02 17:43:02 -07:00

3.4 KiB

AI agent: context & local-performance plan

How the /ai agent gets conversational context, how to deepen it without breaking the RAM-only philosophy, and how to make the local (Ollama) path faster. Everything here stays in process memory — no disk persistence of conversation data, no embeddings on disk. Context dies with the agent process, exactly like the room itself.

Current state (baseline)

  • AgentBridge.transcript: list[Msg] — one flat in-RAM list (bridge.py).
  • Passive capture: every non-addressed line is appended as Msg("user", "sender: text"), trimmed to context_window * 2 (24) messages.
  • On /ai: sends system_prompt + transcript[-context_window:] (last 12).
  • Sandbox: same last-12 window + the task.
  • OllamaProvider.complete posts stream=False, no options, no keep_alive.

Limitations

  1. Recency-only window — no relevance; old context is dropped forever.
  2. Join amnesia — the agent only knows messages seen since connecting, even though the server already sends it the full backlog and the bridge throws it away (init handler reads only users).
  3. Message-count budget, not token budget — fragile on small models.
  4. Flat, untyped transcript — all senders flattened to role user.
  5. stream=False + cold model — high perceived latency.

Key enabling fact

The server keeps the last 1000 (encrypted) messages in RAM (MessageStore, stores.py) and ships them all in the init frame (helpers.send_state). That is a RAM-only history the agent can backfill from on join at zero new cost.

Plan

Tier 1 — context foundation (this branch)

  1. Backfill on join. Consume init.messages: decrypt with room_fernet, drop control frames ({"_…) and our own lines, append to transcript, trim to budget. Pure RAM, ephemeral. (implementing)
  2. Token-budget windowing. Replace the fixed [-12:] slices with a tail-by-token-budget selector (char/4 estimate), capped by a max message count. Used by both the answer and sandbox paths. (implementing)

Tier 1.5 — local performance (this branch)

  1. Pin the model in VRAM via Ollama keep_alive to kill cold-reload stalls.
  2. Tune Ollama options — explicit num_ctx (so the larger window in #1/#2 is actually honored) and bounded num_predict. (implementing)

Tier 2 — deeper context (next branch)

  1. In-RAM semantic retrieval (RAG, no disk). Embed each captured message with the already-present nomic-embed-text, hold vectors in a numpy array in memory; on a /ai question retrieve top-k by cosine and prepend to the recency window. Fully ephemeral.
  2. In-RAM hierarchical compaction. When over budget, summarize the oldest chunk into a single rolling Msg("system", "earlier: …") instead of dropping it — the Claude Code auto-compaction pattern, kept in RAM.

Tier 3 — latency & throughput (next branch)

  1. Token streaming to the room (incremental chat frames) so replies appear as they generate.
  2. Stable prompt prefix (system + summary + retrieved block in fixed order) for Ollama KV-cache reuse across turns.
  3. Single-flight queue so concurrent /ai calls don't pile threads onto one Ollama instance.

Notes on provenance

All patterns above are grounded in Anthropic's public documentation (context compaction, prompt caching, token-budgeted assembly) and the open Agent SDK — no leaked/proprietary source was used.