Add the VirtualBox sandbox design spec (headless 4th backend + share-an-
appliance GUI mode with detect-first install), the crypto pay-to-join gate
design, and the save/load PoC writeup with its demo/film driver scripts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Closes the cross-language half of token streaming (perf-plan A3). On the
CPU-only box perceived latency is time-to-first-token, so showing the reply
as it generates makes a slow model feel live.
- Agent: OllamaProvider.stream() runs on a worker thread; bridge relays
cumulative previews as throttled (~5/sec) `_ai:"stream"` control frames,
then a `done` frame clears the preview as the final persisted chat message
is posted. Providers without stream() fall back to blocking complete().
- Rust client: new Net::AiStream variant + parse_ai branch; App.ai_stream
map holds the in-progress text per agent; draw_chat renders it as a dim,
italic preview bubble below history. Cleared on done and on agent leave.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tier A/B/C wins for the CPU-only Ollama box (no GPU → optimize TTFT and
tokens/sec, not VRAM):
- Separate qwen2.5-coder provider for the sandbox `!task` path; chat keeps
the general model. Auto-selected when chat is Ollama and a coder build is
present, override with --code-model.
- OllamaProvider num_ctx default 8192→4096 (8192 was a GPU-mindset default
that inflates prefill/TTFT on CPU); expose num_thread; add --num-ctx,
--num-thread, --num-predict. token_budget default 3000→2000 to fit.
- OllamaProvider.stream() generator over Ollama's stream=True chat endpoint
(provider half of token streaming; agent/Rust rendering is a follow-up).
- Few-shot request→shell exemplars in SANDBOX_SYSTEM to anchor the small
model's fenced-command output.
- Matryoshka embedding truncation: OllamaEmbedder truncate_dim=256 (--embed-dim)
for faster pure-Python cosine and less RAM; query+stored share the dim.
- docs/ai-perf-plan.md records all 8 items with status and the server-side
env (OLLAMA_NUM_PARALLEL=1, keep_alive) that must be set where ollama serve runs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Give the agent recall of things said beyond the verbatim window, without
breaking the RAM-only philosophy — nothing is persisted to disk.
- MemoryIndex: a capped, in-memory pool of embedded messages with pure-Python
cosine search (no numpy). Retains far more than the rolling transcript so old
lines can be surfaced on demand; oldest evicted past the cap to bound RAM.
- OllamaEmbedder: local embeddings via nomic-embed-text, on by default and
independent of the chat provider (reuses the Ollama host when chat is Ollama).
- Bridge: captured room messages (live + backfilled) are embedded on a
background worker so a slow embedder can't stall frame draining. On a /ai
question the agent retrieves top-k relevant lines, drops weak (<min_score) and
windowed-duplicate hits, and prepends them as a clearly-fenced "recalled
context" preamble — kept at user role, never elevated to system, so untrusted
room text informs without instructing. Falls back to recency-only if the
embedder is unreachable.
- CLI: --no-rag, --embed-model, --embed-host, --rag-top-k.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Roadmap for deepening the /ai agent's conversational context while keeping
the RAM-only philosophy, plus Ollama latency wins. Marks Tier 1 (backfill,
token-budget window) and the perf tuning as in-scope now; RAG and in-RAM
compaction staged next. Grounded in public Anthropic docs, not leaked source.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bump from 960px/12fps to 1280px/15fps with floyd-steinberg dithering
for crisper, retina-legible terminal text — 7.4MB, under GitHub's 10MB
inline-render limit. Exceeds the upstream example.gif (800px/15fps).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4.7MB looping GIF rendered from the latest demo capture (alice+bob
sharing a multipass box: summon, drive, per-user sudo).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make connecting any model a config step, not a code change:
- models.toml named profiles (api_key_env names an env var, never the key)
- providers gain available_models(); add preflight + --list-models/--check
- /ai list and /ai models in-room; client probes local Ollama for
/ai models when no agent is running, and /ai list hints to summon one
- docs/providers.md provider guide + examples/echo_provider.py
- README: command table, AI section, layout updated
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Begin the coven evolution of cmd-chat (see docs/spec-collaborative-sandbox.md):
a Rust/ratatui client for the unchanged Python Sanic server, plus the
multi-user + zero-knowledge groundwork.
P0 — crypto parity (the spec's #1 risk), proven three ways:
- Hand-rolled SRP-6a (NG_2048, SHA-256, rfc5054 padding) matching pysrp
byte-for-byte, incl. the fixed b"chat" SRP identity and minimal-vs-256B
width quirks. Golden-vector unit test + offline selftest.
- Live handshake against the running server (H_AMK verified).
- Cross-language E2E: Python client decrypts a Rust-encrypted Fernet message.
P2 — multi-user coven (server):
- CMD_CHAT_MAX_USERS capacity cap (default 4, infra-for-more).
- Authoritative roster + user_joined broadcasts.
- Free the slot/username on ws disconnect (was held until 1h stale sweep).
Also: fix requirements.txt (was UTF-16, unparseable by pip).
coven/ : Rust crate (crypto.rs proven; main.rs spike CLI: selftest/handshake/srpm)
docs/ : full feature spec for the 6 requested features.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>