Closes the cross-language half of token streaming (perf-plan A3). On the CPU-only box perceived latency is time-to-first-token, so showing the reply as it generates makes a slow model feel live. - Agent: OllamaProvider.stream() runs on a worker thread; bridge relays cumulative previews as throttled (~5/sec) `_ai:"stream"` control frames, then a `done` frame clears the preview as the final persisted chat message is posted. Providers without stream() fall back to blocking complete(). - Rust client: new Net::AiStream variant + parse_ai branch; App.ai_stream map holds the in-progress text per agent; draw_chat renders it as a dim, italic preview bubble below history. Cleared on done and on agent leave. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3.2 KiB
3.2 KiB
AI agent: CPU-only performance & code-quality plan
Hardware reality: the box serving local models is CPU-only (Intel i5-8350U, 4c/8t, no GPU, 62 GB RAM), Ollama 0.3.9. So we optimize time-to-first-token (prefill is O(context)) and tokens/sec, not VRAM. GPU knobs (flash attention, KV-cache quant) are no-ops here.
Status
Tier A — high impact / low effort
qwen2.5-coderfor the sandbox/code path. (done)qwen2.5-coder:1.5bpulled and wired as a separate code provider used only by!task; chat keeps the general model. Same speed, better shell/code. Auto-selected when the chat provider is Ollama and the coder model is present; override with--code-model.- Lower
num_ctxto 4096 + exposenum_thread. (done) OllamaProvider defaultnum_ctx8192→4096 (8192 was a GPU-mindset default that inflated TTFT on CPU);token_budgetdefault 3000→2000 to fit.--num-ctx,--num-thread,--num-predictflags added.num_threaddefaults to Ollama's own (= physical cores, 4 here); benchmark 4/6/8. - Token streaming. (done)
OllamaProvider.stream()yields deltas from Ollama'sstream=Truechat endpoint; the agent relays them as throttled (~5/sec) cumulative_ai:"stream"frames off a worker thread, and the Rust client renders a dim in-progress preview bubble (cleared by adoneframe when the final, persisted message lands). On CPU, perceived latency is TTFT — this makes a slow reply feel live. - Keep model warm + single-flight. (partial)
keep_alivealready 30m (prevents mid-session reload).OLLAMA_NUM_PARALLEL=1is a server-side env read byollama serve, not settable from the agent — set it where Ollama is launched (documented below).
Tier B — code-generation quality
- Few-shot in
SANDBOX_SYSTEM. (done) 1–2 request→shell exemplars to anchor the small model's output format. - GBNF constrained output. (blocked on #7) Ollama 0.3.9 only supports
format: json, not custom grammars for fenced shell. Needs the upgrade; the existing_extract_commandsparser + few-shot cover the gap meanwhile.
Tier C — infra / housekeeping
- Upgrade Ollama 0.3.9 → current. (manual, user-run) System-wide action that
restarts the daemon other projects share — not run automatically. Buys current
coder builds, structured-output/grammar support (unblocks #6), bugfixes. CPU
speed gains are incremental. Suggested:
curl -fsSL https://ollama.com/install.sh | sh. - Matryoshka embedding truncation. (done) nomic-embed-text is MRL-trained;
truncate vectors to 256-dim (
--embed-dim) for faster pure-Python cosine and less RAM. Query + stored use the same dim, so cosine stays correct.
Server-side env (set where ollama serve runs, e.g. systemd unit or shell)
OLLAMA_NUM_PARALLEL=1 # single interactive user → all cores to one request
OLLAMA_KEEP_ALIVE=30m # or -1 to pin forever (62 GB RAM is plenty)
Notes
All grounded in public sources + the Obsidian vault (research/2026-06-02-*):
Q4_K_M is the CPU speed sweet spot, small num_ctx beats "context rot", and
qwen2.5-coder beats the general model at equal size for code.