Tier A/B/C wins for the CPU-only Ollama box (no GPU → optimize TTFT and tokens/sec, not VRAM): - Separate qwen2.5-coder provider for the sandbox `!task` path; chat keeps the general model. Auto-selected when chat is Ollama and a coder build is present, override with --code-model. - OllamaProvider num_ctx default 8192→4096 (8192 was a GPU-mindset default that inflates prefill/TTFT on CPU); expose num_thread; add --num-ctx, --num-thread, --num-predict. token_budget default 3000→2000 to fit. - OllamaProvider.stream() generator over Ollama's stream=True chat endpoint (provider half of token streaming; agent/Rust rendering is a follow-up). - Few-shot request→shell exemplars in SANDBOX_SYSTEM to anchor the small model's fenced-command output. - Matryoshka embedding truncation: OllamaEmbedder truncate_dim=256 (--embed-dim) for faster pure-Python cosine and less RAM; query+stored share the dim. - docs/ai-perf-plan.md records all 8 items with status and the server-side env (OLLAMA_NUM_PARALLEL=1, keep_alive) that must be set where ollama serve runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3.1 KiB
3.1 KiB
AI agent: CPU-only performance & code-quality plan
Hardware reality: the box serving local models is CPU-only (Intel i5-8350U, 4c/8t, no GPU, 62 GB RAM), Ollama 0.3.9. So we optimize time-to-first-token (prefill is O(context)) and tokens/sec, not VRAM. GPU knobs (flash attention, KV-cache quant) are no-ops here.
Status
Tier A — high impact / low effort
qwen2.5-coderfor the sandbox/code path. (done)qwen2.5-coder:1.5bpulled and wired as a separate code provider used only by!task; chat keeps the general model. Same speed, better shell/code. Auto-selected when the chat provider is Ollama and the coder model is present; override with--code-model.- Lower
num_ctxto 4096 + exposenum_thread. (done) OllamaProvider defaultnum_ctx8192→4096 (8192 was a GPU-mindset default that inflated TTFT on CPU);token_budgetdefault 3000→2000 to fit.--num-ctx,--num-thread,--num-predictflags added.num_threaddefaults to Ollama's own (= physical cores, 4 here); benchmark 4/6/8. - Token streaming. (partial — provider half done)
OllamaProvider.stream()now yields deltas from Ollama'sstream=Truechat endpoint. Still TODO (commit 2): have the agent emit_ai:"stream"delta frames and the Rust client render an in-progress bubble. On CPU, perceived latency is TTFT — this will make a slow reply feel live. - Keep model warm + single-flight. (partial)
keep_alivealready 30m (prevents mid-session reload).OLLAMA_NUM_PARALLEL=1is a server-side env read byollama serve, not settable from the agent — set it where Ollama is launched (documented below).
Tier B — code-generation quality
- Few-shot in
SANDBOX_SYSTEM. (done) 1–2 request→shell exemplars to anchor the small model's output format. - GBNF constrained output. (blocked on #7) Ollama 0.3.9 only supports
format: json, not custom grammars for fenced shell. Needs the upgrade; the existing_extract_commandsparser + few-shot cover the gap meanwhile.
Tier C — infra / housekeeping
- Upgrade Ollama 0.3.9 → current. (manual, user-run) System-wide action that
restarts the daemon other projects share — not run automatically. Buys current
coder builds, structured-output/grammar support (unblocks #6), bugfixes. CPU
speed gains are incremental. Suggested:
curl -fsSL https://ollama.com/install.sh | sh. - Matryoshka embedding truncation. (done) nomic-embed-text is MRL-trained;
truncate vectors to 256-dim (
--embed-dim) for faster pure-Python cosine and less RAM. Query + stored use the same dim, so cosine stays correct.
Server-side env (set where ollama serve runs, e.g. systemd unit or shell)
OLLAMA_NUM_PARALLEL=1 # single interactive user → all cores to one request
OLLAMA_KEEP_ALIVE=30m # or -1 to pin forever (62 GB RAM is plenty)
Notes
All grounded in public sources + the Obsidian vault (research/2026-06-02-*):
Q4_K_M is the CPU speed sweet spot, small num_ctx beats "context rot", and
qwen2.5-coder beats the general model at equal size for code.