hack-house

Trilltechnician/hack-house

Fork 0

Commit Graph

Author	SHA1	Message	Date
leetcrypt	69bce5ead8	feat(ai): stream agent replies token-by-token to the room Closes the cross-language half of token streaming (perf-plan A3). On the CPU-only box perceived latency is time-to-first-token, so showing the reply as it generates makes a slow model feel live. - Agent: OllamaProvider.stream() runs on a worker thread; bridge relays cumulative previews as throttled (~5/sec) `_ai:"stream"` control frames, then a `done` frame clears the preview as the final persisted chat message is posted. Providers without stream() fall back to blocking complete(). - Rust client: new Net::AiStream variant + parse_ai branch; App.ai_stream map holds the in-progress text per agent; draw_chat renders it as a dim, italic preview bubble below history. Cleared on done and on agent leave. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-02 22:42:08 -07:00
leetcrypt	26c651e9ac	perf(ai): CPU-tuned local inference + qwen2.5-coder sandbox path Tier A/B/C wins for the CPU-only Ollama box (no GPU → optimize TTFT and tokens/sec, not VRAM): - Separate qwen2.5-coder provider for the sandbox `!task` path; chat keeps the general model. Auto-selected when chat is Ollama and a coder build is present, override with --code-model. - OllamaProvider num_ctx default 8192→4096 (8192 was a GPU-mindset default that inflates prefill/TTFT on CPU); expose num_thread; add --num-ctx, --num-thread, --num-predict. token_budget default 3000→2000 to fit. - OllamaProvider.stream() generator over Ollama's stream=True chat endpoint (provider half of token streaming; agent/Rust rendering is a follow-up). - Few-shot request→shell exemplars in SANDBOX_SYSTEM to anchor the small model's fenced-command output. - Matryoshka embedding truncation: OllamaEmbedder truncate_dim=256 (--embed-dim) for faster pure-Python cosine and less RAM; query+stored share the dim. - docs/ai-perf-plan.md records all 8 items with status and the server-side env (OLLAMA_NUM_PARALLEL=1, keep_alive) that must be set where ollama serve runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-02 22:37:59 -07:00

Author

SHA1

Message

Date

leetcrypt

69bce5ead8

feat(ai): stream agent replies token-by-token to the room

Closes the cross-language half of token streaming (perf-plan A3). On the
CPU-only box perceived latency is time-to-first-token, so showing the reply
as it generates makes a slow model feel live.

- Agent: OllamaProvider.stream() runs on a worker thread; bridge relays
  cumulative previews as throttled (~5/sec) `_ai:"stream"` control frames,
  then a `done` frame clears the preview as the final persisted chat message
  is posted. Providers without stream() fall back to blocking complete().
- Rust client: new Net::AiStream variant + parse_ai branch; App.ai_stream
  map holds the in-progress text per agent; draw_chat renders it as a dim,
  italic preview bubble below history. Cleared on done and on agent leave.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-06-02 22:42:08 -07:00

leetcrypt

26c651e9ac

perf(ai): CPU-tuned local inference + qwen2.5-coder sandbox path

Tier A/B/C wins for the CPU-only Ollama box (no GPU → optimize TTFT and
tokens/sec, not VRAM):

- Separate qwen2.5-coder provider for the sandbox `!task` path; chat keeps
  the general model. Auto-selected when chat is Ollama and a coder build is
  present, override with --code-model.
- OllamaProvider num_ctx default 8192→4096 (8192 was a GPU-mindset default
  that inflates prefill/TTFT on CPU); expose num_thread; add --num-ctx,
  --num-thread, --num-predict. token_budget default 3000→2000 to fit.
- OllamaProvider.stream() generator over Ollama's stream=True chat endpoint
  (provider half of token streaming; agent/Rust rendering is a follow-up).
- Few-shot request→shell exemplars in SANDBOX_SYSTEM to anchor the small
  model's fenced-command output.
- Matryoshka embedding truncation: OllamaEmbedder truncate_dim=256 (--embed-dim)
  for faster pure-Python cosine and less RAM; query+stored share the dim.
- docs/ai-perf-plan.md records all 8 items with status and the server-side
  env (OLLAMA_NUM_PARALLEL=1, keep_alive) that must be set where ollama serve runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-06-02 22:37:59 -07:00

2 Commits