leetcrypt e5e1ad8dee feat(ai): in-RAM semantic recall (RAG) for conversation context

Give the agent recall of things said beyond the verbatim window, without
breaking the RAM-only philosophy — nothing is persisted to disk.

- MemoryIndex: a capped, in-memory pool of embedded messages with pure-Python
  cosine search (no numpy). Retains far more than the rolling transcript so old
  lines can be surfaced on demand; oldest evicted past the cap to bound RAM.
- OllamaEmbedder: local embeddings via nomic-embed-text, on by default and
  independent of the chat provider (reuses the Ollama host when chat is Ollama).
- Bridge: captured room messages (live + backfilled) are embedded on a
  background worker so a slow embedder can't stall frame draining. On a /ai
  question the agent retrieves top-k relevant lines, drops weak (<min_score) and
  windowed-duplicate hits, and prepends them as a clearly-fenced "recalled
  context" preamble — kept at user role, never elevated to system, so untrusted
  room text informs without instructing. Falls back to recency-only if the
  embedder is unreachable.
- CLI: --no-rag, --embed-model, --embed-host, --rag-top-k.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-06-02 17:59:01 -07:00

3.8 KiB

Raw Blame History

AI agent: context & local-performance plan

How the /ai agent gets conversational context, how to deepen it without breaking the RAM-only philosophy, and how to make the local (Ollama) path faster. Everything here stays in process memory — no disk persistence of conversation data, no embeddings on disk. Context dies with the agent process, exactly like the room itself.

Current state (baseline)

AgentBridge.transcript: list[Msg] — one flat in-RAM list (bridge.py).
Passive capture: every non-addressed line is appended as Msg("user", "sender: text"), trimmed to context_window * 2 (24) messages.
On /ai: sends system_prompt + transcript[-context_window:] (last 12).
Sandbox: same last-12 window + the task.
OllamaProvider.complete posts stream=False, no options, no keep_alive.

Limitations

Recency-only window — no relevance; old context is dropped forever.
Join amnesia — the agent only knows messages seen since connecting, even though the server already sends it the full backlog and the bridge throws it away (init handler reads only users).
Message-count budget, not token budget — fragile on small models.
Flat, untyped transcript — all senders flattened to role user.
stream=False + cold model — high perceived latency.

Key enabling fact

The server keeps the last 1000 (encrypted) messages in RAM (MessageStore, stores.py) and ships them all in the init frame (helpers.send_state). That is a RAM-only history the agent can backfill from on join at zero new cost.

Plan

Tier 1 — context foundation (this branch)

Backfill on join. Consume init.messages: decrypt with room_fernet, drop control frames ({"_…) and our own lines, append to transcript, trim to budget. Pure RAM, ephemeral. (implementing)
Token-budget windowing. Replace the fixed [-12:] slices with a tail-by-token-budget selector (char/4 estimate), capped by a max message count. Used by both the answer and sandbox paths. (implementing)

Tier 1.5 — local performance (this branch)

Pin the model in VRAM via Ollama keep_alive to kill cold-reload stalls.
Tune Ollama options — explicit num_ctx (so the larger window in #1/#2 is actually honored) and bounded num_predict. (implementing)

Tier 2 — deeper context

In-RAM semantic retrieval (RAG, no disk). (done) Each captured message is embedded with the already-present nomic-embed-text and held in a capped in-memory MemoryIndex (pure-Python cosine, no numpy). On a /ai question the agent embeds the query, retrieves top-k, drops weak/duplicate hits, and prepends them as a clearly-fenced "recalled context" preamble (never system role — keeps untrusted text from instructing). Embedding runs on a background worker so it can't stall the recv loop; if the embedder is unreachable it degrades to recency-only. Toggle with --no-rag / --rag-top-k.
In-RAM hierarchical compaction. (staged) When over budget, summarize the oldest chunk into a single rolling Msg("system", "earlier: …") instead of dropping it — the Claude Code auto-compaction pattern, kept in RAM.

Tier 3 — latency & throughput (next branch)

Token streaming to the room (incremental chat frames) so replies appear as they generate.
Stable prompt prefix (system + summary + retrieved block in fixed order) for Ollama KV-cache reuse across turns.
Single-flight queue so concurrent /ai calls don't pile threads onto one Ollama instance.

Notes on provenance

All patterns above are grounded in Anthropic's public documentation (context compaction, prompt caching, token-budgeted assembly) and the open Agent SDK — no leaked/proprietary source was used.

3.8 KiB Raw Blame History