hack-house/docs/ai-perf-plan.md
leetcrypt 26c651e9ac perf(ai): CPU-tuned local inference + qwen2.5-coder sandbox path
Tier A/B/C wins for the CPU-only Ollama box (no GPU → optimize TTFT and
tokens/sec, not VRAM):

- Separate qwen2.5-coder provider for the sandbox `!task` path; chat keeps
  the general model. Auto-selected when chat is Ollama and a coder build is
  present, override with --code-model.
- OllamaProvider num_ctx default 8192→4096 (8192 was a GPU-mindset default
  that inflates prefill/TTFT on CPU); expose num_thread; add --num-ctx,
  --num-thread, --num-predict. token_budget default 3000→2000 to fit.
- OllamaProvider.stream() generator over Ollama's stream=True chat endpoint
  (provider half of token streaming; agent/Rust rendering is a follow-up).
- Few-shot request→shell exemplars in SANDBOX_SYSTEM to anchor the small
  model's fenced-command output.
- Matryoshka embedding truncation: OllamaEmbedder truncate_dim=256 (--embed-dim)
  for faster pure-Python cosine and less RAM; query+stored share the dim.
- docs/ai-perf-plan.md records all 8 items with status and the server-side
  env (OLLAMA_NUM_PARALLEL=1, keep_alive) that must be set where ollama serve runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-02 22:37:59 -07:00

3.1 KiB
Raw Blame History

AI agent: CPU-only performance & code-quality plan

Hardware reality: the box serving local models is CPU-only (Intel i5-8350U, 4c/8t, no GPU, 62 GB RAM), Ollama 0.3.9. So we optimize time-to-first-token (prefill is O(context)) and tokens/sec, not VRAM. GPU knobs (flash attention, KV-cache quant) are no-ops here.

Status

Tier A — high impact / low effort

  1. qwen2.5-coder for the sandbox/code path. (done) qwen2.5-coder:1.5b pulled and wired as a separate code provider used only by !task; chat keeps the general model. Same speed, better shell/code. Auto-selected when the chat provider is Ollama and the coder model is present; override with --code-model.
  2. Lower num_ctx to 4096 + expose num_thread. (done) OllamaProvider default num_ctx 8192→4096 (8192 was a GPU-mindset default that inflated TTFT on CPU); token_budget default 3000→2000 to fit. --num-ctx, --num-thread, --num-predict flags added. num_thread defaults to Ollama's own (= physical cores, 4 here); benchmark 4/6/8.
  3. Token streaming. (partial — provider half done) OllamaProvider.stream() now yields deltas from Ollama's stream=True chat endpoint. Still TODO (commit 2): have the agent emit _ai:"stream" delta frames and the Rust client render an in-progress bubble. On CPU, perceived latency is TTFT — this will make a slow reply feel live.
  4. Keep model warm + single-flight. (partial) keep_alive already 30m (prevents mid-session reload). OLLAMA_NUM_PARALLEL=1 is a server-side env read by ollama serve, not settable from the agent — set it where Ollama is launched (documented below).

Tier B — code-generation quality

  1. Few-shot in SANDBOX_SYSTEM. (done) 12 request→shell exemplars to anchor the small model's output format.
  2. GBNF constrained output. (blocked on #7) Ollama 0.3.9 only supports format: json, not custom grammars for fenced shell. Needs the upgrade; the existing _extract_commands parser + few-shot cover the gap meanwhile.

Tier C — infra / housekeeping

  1. Upgrade Ollama 0.3.9 → current. (manual, user-run) System-wide action that restarts the daemon other projects share — not run automatically. Buys current coder builds, structured-output/grammar support (unblocks #6), bugfixes. CPU speed gains are incremental. Suggested: curl -fsSL https://ollama.com/install.sh | sh.
  2. Matryoshka embedding truncation. (done) nomic-embed-text is MRL-trained; truncate vectors to 256-dim (--embed-dim) for faster pure-Python cosine and less RAM. Query + stored use the same dim, so cosine stays correct.

Server-side env (set where ollama serve runs, e.g. systemd unit or shell)

OLLAMA_NUM_PARALLEL=1      # single interactive user → all cores to one request
OLLAMA_KEEP_ALIVE=30m      # or -1 to pin forever (62 GB RAM is plenty)

Notes

All grounded in public sources + the Obsidian vault (research/2026-06-02-*): Q4_K_M is the CPU speed sweet spot, small num_ctx beats "context rot", and qwen2.5-coder beats the general model at equal size for code.