Closes the cross-language half of token streaming (perf-plan A3). On the
CPU-only box perceived latency is time-to-first-token, so showing the reply
as it generates makes a slow model feel live.
- Agent: OllamaProvider.stream() runs on a worker thread; bridge relays
cumulative previews as throttled (~5/sec) `_ai:"stream"` control frames,
then a `done` frame clears the preview as the final persisted chat message
is posted. Providers without stream() fall back to blocking complete().
- Rust client: new Net::AiStream variant + parse_ai branch; App.ai_stream
map holds the in-progress text per agent; draw_chat renders it as a dim,
italic preview bubble below history. Cleared on done and on agent leave.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tier A/B/C wins for the CPU-only Ollama box (no GPU → optimize TTFT and
tokens/sec, not VRAM):
- Separate qwen2.5-coder provider for the sandbox `!task` path; chat keeps
the general model. Auto-selected when chat is Ollama and a coder build is
present, override with --code-model.
- OllamaProvider num_ctx default 8192→4096 (8192 was a GPU-mindset default
that inflates prefill/TTFT on CPU); expose num_thread; add --num-ctx,
--num-thread, --num-predict. token_budget default 3000→2000 to fit.
- OllamaProvider.stream() generator over Ollama's stream=True chat endpoint
(provider half of token streaming; agent/Rust rendering is a follow-up).
- Few-shot request→shell exemplars in SANDBOX_SYSTEM to anchor the small
model's fenced-command output.
- Matryoshka embedding truncation: OllamaEmbedder truncate_dim=256 (--embed-dim)
for faster pure-Python cosine and less RAM; query+stored share the dim.
- docs/ai-perf-plan.md records all 8 items with status and the server-side
env (OLLAMA_NUM_PARALLEL=1, keep_alive) that must be set where ollama serve runs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>