hack-house/docs/ai-perf-plan.md
leetcrypt 69bce5ead8 feat(ai): stream agent replies token-by-token to the room
Closes the cross-language half of token streaming (perf-plan A3). On the
CPU-only box perceived latency is time-to-first-token, so showing the reply
as it generates makes a slow model feel live.

- Agent: OllamaProvider.stream() runs on a worker thread; bridge relays
  cumulative previews as throttled (~5/sec) `_ai:"stream"` control frames,
  then a `done` frame clears the preview as the final persisted chat message
  is posted. Providers without stream() fall back to blocking complete().
- Rust client: new Net::AiStream variant + parse_ai branch; App.ai_stream
  map holds the in-progress text per agent; draw_chat renders it as a dim,
  italic preview bubble below history. Cleared on done and on agent leave.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-02 22:42:08 -07:00

57 lines
3.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AI agent: CPU-only performance & code-quality plan
Hardware reality: the box serving local models is **CPU-only** (Intel i5-8350U,
4c/8t, no GPU, 62 GB RAM), Ollama 0.3.9. So we optimize **time-to-first-token**
(prefill is O(context)) and **tokens/sec**, not VRAM. GPU knobs (flash attention,
KV-cache quant) are no-ops here.
## Status
### Tier A — high impact / low effort
1. **`qwen2.5-coder` for the sandbox/code path.** *(done)* `qwen2.5-coder:1.5b`
pulled and wired as a separate code provider used only by `!task`; chat keeps
the general model. Same speed, better shell/code. Auto-selected when the chat
provider is Ollama and the coder model is present; override with `--code-model`.
2. **Lower `num_ctx` to 4096 + expose `num_thread`.** *(done)* OllamaProvider
default `num_ctx` 8192→4096 (8192 was a GPU-mindset default that inflated TTFT
on CPU); `token_budget` default 3000→2000 to fit. `--num-ctx`, `--num-thread`,
`--num-predict` flags added. `num_thread` defaults to Ollama's own (= physical
cores, 4 here); benchmark 4/6/8.
3. **Token streaming.** *(done)* `OllamaProvider.stream()` yields deltas from
Ollama's `stream=True` chat endpoint; the agent relays them as throttled
(~5/sec) cumulative `_ai:"stream"` frames off a worker thread, and the Rust
client renders a dim in-progress preview bubble (cleared by a `done` frame
when the final, persisted message lands). On CPU, perceived latency is TTFT —
this makes a slow reply feel live.
4. **Keep model warm + single-flight.** *(partial)* `keep_alive` already 30m
(prevents mid-session reload). `OLLAMA_NUM_PARALLEL=1` is a **server-side env**
read by `ollama serve`, not settable from the agent — set it where Ollama is
launched (documented below).
### Tier B — code-generation quality
5. **Few-shot in `SANDBOX_SYSTEM`.** *(done)* 12 request→shell exemplars to anchor
the small model's output format.
6. **GBNF constrained output.** *(blocked on #7)* Ollama 0.3.9 only supports
`format: json`, not custom grammars for fenced shell. Needs the upgrade; the
existing `_extract_commands` parser + few-shot cover the gap meanwhile.
### Tier C — infra / housekeeping
7. **Upgrade Ollama 0.3.9 → current.** *(manual, user-run)* System-wide action that
restarts the daemon other projects share — not run automatically. Buys current
coder builds, structured-output/grammar support (unblocks #6), bugfixes. CPU
speed gains are incremental. Suggested: `curl -fsSL https://ollama.com/install.sh | sh`.
8. **Matryoshka embedding truncation.** *(done)* nomic-embed-text is MRL-trained;
truncate vectors to 256-dim (`--embed-dim`) for faster pure-Python cosine and
less RAM. Query + stored use the same dim, so cosine stays correct.
## Server-side env (set where `ollama serve` runs, e.g. systemd unit or shell)
```
OLLAMA_NUM_PARALLEL=1 # single interactive user → all cores to one request
OLLAMA_KEEP_ALIVE=30m # or -1 to pin forever (62 GB RAM is plenty)
```
## Notes
All grounded in public sources + the Obsidian vault (`research/2026-06-02-*`):
Q4_K_M is the CPU speed sweet spot, small `num_ctx` beats "context rot", and
qwen2.5-coder beats the general model at equal size for code.