hack-house/docs/ai-perf-plan.md
leetcrypt 26c651e9ac perf(ai): CPU-tuned local inference + qwen2.5-coder sandbox path
Tier A/B/C wins for the CPU-only Ollama box (no GPU → optimize TTFT and
tokens/sec, not VRAM):

- Separate qwen2.5-coder provider for the sandbox `!task` path; chat keeps
  the general model. Auto-selected when chat is Ollama and a coder build is
  present, override with --code-model.
- OllamaProvider num_ctx default 8192→4096 (8192 was a GPU-mindset default
  that inflates prefill/TTFT on CPU); expose num_thread; add --num-ctx,
  --num-thread, --num-predict. token_budget default 3000→2000 to fit.
- OllamaProvider.stream() generator over Ollama's stream=True chat endpoint
  (provider half of token streaming; agent/Rust rendering is a follow-up).
- Few-shot request→shell exemplars in SANDBOX_SYSTEM to anchor the small
  model's fenced-command output.
- Matryoshka embedding truncation: OllamaEmbedder truncate_dim=256 (--embed-dim)
  for faster pure-Python cosine and less RAM; query+stored share the dim.
- docs/ai-perf-plan.md records all 8 items with status and the server-side
  env (OLLAMA_NUM_PARALLEL=1, keep_alive) that must be set where ollama serve runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-02 22:37:59 -07:00

56 lines
3.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AI agent: CPU-only performance & code-quality plan
Hardware reality: the box serving local models is **CPU-only** (Intel i5-8350U,
4c/8t, no GPU, 62 GB RAM), Ollama 0.3.9. So we optimize **time-to-first-token**
(prefill is O(context)) and **tokens/sec**, not VRAM. GPU knobs (flash attention,
KV-cache quant) are no-ops here.
## Status
### Tier A — high impact / low effort
1. **`qwen2.5-coder` for the sandbox/code path.** *(done)* `qwen2.5-coder:1.5b`
pulled and wired as a separate code provider used only by `!task`; chat keeps
the general model. Same speed, better shell/code. Auto-selected when the chat
provider is Ollama and the coder model is present; override with `--code-model`.
2. **Lower `num_ctx` to 4096 + expose `num_thread`.** *(done)* OllamaProvider
default `num_ctx` 8192→4096 (8192 was a GPU-mindset default that inflated TTFT
on CPU); `token_budget` default 3000→2000 to fit. `--num-ctx`, `--num-thread`,
`--num-predict` flags added. `num_thread` defaults to Ollama's own (= physical
cores, 4 here); benchmark 4/6/8.
3. **Token streaming.** *(partial — provider half done)* `OllamaProvider.stream()`
now yields deltas from Ollama's `stream=True` chat endpoint. Still TODO (commit 2):
have the agent emit `_ai:"stream"` delta frames and the Rust client render an
in-progress bubble. On CPU, perceived latency is TTFT — this will make a slow
reply feel live.
4. **Keep model warm + single-flight.** *(partial)* `keep_alive` already 30m
(prevents mid-session reload). `OLLAMA_NUM_PARALLEL=1` is a **server-side env**
read by `ollama serve`, not settable from the agent — set it where Ollama is
launched (documented below).
### Tier B — code-generation quality
5. **Few-shot in `SANDBOX_SYSTEM`.** *(done)* 12 request→shell exemplars to anchor
the small model's output format.
6. **GBNF constrained output.** *(blocked on #7)* Ollama 0.3.9 only supports
`format: json`, not custom grammars for fenced shell. Needs the upgrade; the
existing `_extract_commands` parser + few-shot cover the gap meanwhile.
### Tier C — infra / housekeeping
7. **Upgrade Ollama 0.3.9 → current.** *(manual, user-run)* System-wide action that
restarts the daemon other projects share — not run automatically. Buys current
coder builds, structured-output/grammar support (unblocks #6), bugfixes. CPU
speed gains are incremental. Suggested: `curl -fsSL https://ollama.com/install.sh | sh`.
8. **Matryoshka embedding truncation.** *(done)* nomic-embed-text is MRL-trained;
truncate vectors to 256-dim (`--embed-dim`) for faster pure-Python cosine and
less RAM. Query + stored use the same dim, so cosine stays correct.
## Server-side env (set where `ollama serve` runs, e.g. systemd unit or shell)
```
OLLAMA_NUM_PARALLEL=1 # single interactive user → all cores to one request
OLLAMA_KEEP_ALIVE=30m # or -1 to pin forever (62 GB RAM is plenty)
```
## Notes
All grounded in public sources + the Obsidian vault (`research/2026-06-02-*`):
Q4_K_M is the CPU speed sweet spot, small `num_ctx` beats "context rot", and
qwen2.5-coder beats the general model at equal size for code.