Tier A/B/C wins for the CPU-only Ollama box (no GPU → optimize TTFT and
tokens/sec, not VRAM):
- Separate qwen2.5-coder provider for the sandbox `!task` path; chat keeps
the general model. Auto-selected when chat is Ollama and a coder build is
present, override with --code-model.
- OllamaProvider num_ctx default 8192→4096 (8192 was a GPU-mindset default
that inflates prefill/TTFT on CPU); expose num_thread; add --num-ctx,
--num-thread, --num-predict. token_budget default 3000→2000 to fit.
- OllamaProvider.stream() generator over Ollama's stream=True chat endpoint
(provider half of token streaming; agent/Rust rendering is a follow-up).
- Few-shot request→shell exemplars in SANDBOX_SYSTEM to anchor the small
model's fenced-command output.
- Matryoshka embedding truncation: OllamaEmbedder truncate_dim=256 (--embed-dim)
for faster pure-Python cosine and less RAM; query+stored share the dim.
- docs/ai-perf-plan.md records all 8 items with status and the server-side
env (OLLAMA_NUM_PARALLEL=1, keep_alive) that must be set where ollama serve runs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Give the agent recall of things said beyond the verbatim window, without
breaking the RAM-only philosophy — nothing is persisted to disk.
- MemoryIndex: a capped, in-memory pool of embedded messages with pure-Python
cosine search (no numpy). Retains far more than the rolling transcript so old
lines can be surfaced on demand; oldest evicted past the cap to bound RAM.
- OllamaEmbedder: local embeddings via nomic-embed-text, on by default and
independent of the chat provider (reuses the Ollama host when chat is Ollama).
- Bridge: captured room messages (live + backfilled) are embedded on a
background worker so a slow embedder can't stall frame draining. On a /ai
question the agent retrieves top-k relevant lines, drops weak (<min_score) and
windowed-duplicate hits, and prepends them as a clearly-fenced "recalled
context" preamble — kept at user role, never elevated to system, so untrusted
room text informs without instructing. Falls back to recency-only if the
embedder is unreachable.
- CLI: --no-rag, --embed-model, --embed-host, --rag-top-k.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
OllamaProvider now sends keep_alive (default 30m) so the model stays resident
in VRAM between /ai calls instead of cold-reloading, and sets explicit options
(num_ctx 8192, num_predict 512) — Ollama otherwise caps context at 2048, which
would silently truncate the larger backfilled window.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make connecting any model a config step, not a code change:
- models.toml named profiles (api_key_env names an env var, never the key)
- providers gain available_models(); add preflight + --list-models/--check
- /ai list and /ai models in-room; client probes local Ollama for
/ai models when no agent is running, and /ai list hints to summon one
- docs/providers.md provider guide + examples/echo_provider.py
- README: command table, AI section, layout updated
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add cmd_chat/agent: a headless client that joins a room via SRP, decrypts
broadcasts, and answers /ai <question> through a pluggable model provider
(ollama default + anthropic + openai-compatible + module:Class). Server and
zero-knowledge guarantees unchanged; the agent is just another encrypted client.
Also pin the lets-hack demo to a detached worktree of main (default) so running
it from dev still demos stable main without touching the working checkout.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>