leetcrypt bbb9e82425 docs: plan for AI agent context + local-perf improvements

Roadmap for deepening the /ai agent's conversational context while keeping
the RAM-only philosophy, plus Ollama latency wins. Marks Tier 1 (backfill,
token-budget window) and the perf tuning as in-scope now; RAG and in-RAM
compaction staged next. Grounded in public Anthropic docs, not leaked source.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-06-02 17:43:02 -07:00

3.4 KiB

Raw Blame History

AI agent: context & local-performance plan

How the /ai agent gets conversational context, how to deepen it without breaking the RAM-only philosophy, and how to make the local (Ollama) path faster. Everything here stays in process memory — no disk persistence of conversation data, no embeddings on disk. Context dies with the agent process, exactly like the room itself.

Current state (baseline)

AgentBridge.transcript: list[Msg] — one flat in-RAM list (bridge.py).
Passive capture: every non-addressed line is appended as Msg("user", "sender: text"), trimmed to context_window * 2 (24) messages.
On /ai: sends system_prompt + transcript[-context_window:] (last 12).
Sandbox: same last-12 window + the task.
OllamaProvider.complete posts stream=False, no options, no keep_alive.

Limitations

Recency-only window — no relevance; old context is dropped forever.
Join amnesia — the agent only knows messages seen since connecting, even though the server already sends it the full backlog and the bridge throws it away (init handler reads only users).
Message-count budget, not token budget — fragile on small models.
Flat, untyped transcript — all senders flattened to role user.
stream=False + cold model — high perceived latency.

Key enabling fact

The server keeps the last 1000 (encrypted) messages in RAM (MessageStore, stores.py) and ships them all in the init frame (helpers.send_state). That is a RAM-only history the agent can backfill from on join at zero new cost.

Plan

Tier 1 — context foundation (this branch)

Backfill on join. Consume init.messages: decrypt with room_fernet, drop control frames ({"_…) and our own lines, append to transcript, trim to budget. Pure RAM, ephemeral. (implementing)
Token-budget windowing. Replace the fixed [-12:] slices with a tail-by-token-budget selector (char/4 estimate), capped by a max message count. Used by both the answer and sandbox paths. (implementing)

Tier 1.5 — local performance (this branch)

Pin the model in VRAM via Ollama keep_alive to kill cold-reload stalls.
Tune Ollama options — explicit num_ctx (so the larger window in #1/#2 is actually honored) and bounded num_predict. (implementing)

Tier 2 — deeper context (next branch)

In-RAM semantic retrieval (RAG, no disk). Embed each captured message with the already-present nomic-embed-text, hold vectors in a numpy array in memory; on a /ai question retrieve top-k by cosine and prepend to the recency window. Fully ephemeral.
In-RAM hierarchical compaction. When over budget, summarize the oldest chunk into a single rolling Msg("system", "earlier: …") instead of dropping it — the Claude Code auto-compaction pattern, kept in RAM.

Tier 3 — latency & throughput (next branch)

Token streaming to the room (incremental chat frames) so replies appear as they generate.
Stable prompt prefix (system + summary + retrieved block in fixed order) for Ollama KV-cache reuse across turns.
Single-flight queue so concurrent /ai calls don't pile threads onto one Ollama instance.

Notes on provenance

All patterns above are grounded in Anthropic's public documentation (context compaction, prompt caching, token-budgeted assembly) and the open Agent SDK — no leaked/proprietary source was used.

3.4 KiB Raw Blame History