# Ideas TODO This file is now prioritized inline. Priority scale: - P0: do next; directly improves current feedback loop - P1: important after the current loop is usable - P2: useful, but only after basics are stable - P3: defer or maybe reject ## P0: Make Integration Tests Easy To Run Status: implemented. Implemented command: ```powershell python -m nightshift.cli integ-test --template tutorial-deaddrop --task TASK-001 ``` It creates the integration sandbox, sets up the venv, runs validation through setup, runs the task from the generated project directory, and prints the artifact root. Use `--dry-run` to preview the setup and task command. Running integration tests is still too manual. Current process: - install the current version of NightShift - run `python -m nightshift.cli integ-run --template tutorial-deaddrop --setup` - copy the activation line from the output and run it - `cd` into the generated directory - run the task there, because running from the repo root does not find `nightshift.yaml` Recommendation: implement a wrapper command, not just a loose script. Target command: ```powershell python -m nightshift.cli integ-test --template tutorial-deaddrop --task TASK-001 ``` It should: 1. create the integration run 2. set up the venv 3. install NightShift from the current checkout 4. run `nightshift validate` 5. run the selected task from the generated project directory 6. print final status and artifact path Useful variants: ```powershell python -m nightshift.cli integ-test --template tutorial-deaddrop --all python -m nightshift.cli integ-test --template tutorial-deaddrop --task TASK-002 --keep 3 ``` The base-directory config issue may not be a core bug, but it is bad UX. The wrapper should handle `cwd` correctly. ## P0/P1: Remove Multi-Candidate Workflow From Default DeadDrop Status: implemented for the default DeadDrop template and tutorial example. Original idea: - The multi-candidate workflow does not add as much as expected. - Keep it as an example, maybe `example-multiagent`. Recommendation: yes. Remove it from the default DeadDrop tutorial. Reason: - DeadDrop is becoming the reliability harness. - Multi-candidate fallback makes artifacts harder to reason about. - It adds model variability while we are still debugging pipeline behavior. Better split: ```text tutorial-deaddrop tutorial-deaddrop-multiagent ``` or: ```text examples/templates/multiagent-fallback ``` Default DeadDrop should be boring: ```text planner -> semantic_context -> context -> implement -> validate -> test -> review ``` Use one strong implementer first. Add fallback only in a separate experiment template. ## P1: Add A Qwen3 / 30B DeadDrop Variant Status: implemented as the default DeadDrop model path using `qwen3-coder:30b`. Original idea: - Use a non-coder model for planner roles. - Try `qwen3.6:27b` for planning. - Use `qwen3-coder:30b` for implementer and code-heavy roles. Recommendation: viable, but make this a variant, not the default. kass reply- No lets make this the default. the qwen3-coder:30b is fast now for me for some reason. Suggested template/config: ```text tutorial-deaddrop-qwen3 ``` Possible role split: - planner: `qwen3.6:27b` - reviewer/debugger: `qwen3.6:27b` - implementer: `qwen3-coder:30b` or exact local 30B coder model name Important: confirm exact model names with: ```powershell ollama list ``` i did its `qwen3-coder:30b` Use 30B where it pays: - first implementation for hard tasks - repair after concrete test failure - schema/database changes - multi-file changes Do not blindly make every stage 30B if it is slow. reply: Its not slow now!`qwen3-coder:30b` ## P2: Expose More Model Parameters Status: implemented for the practical first set. Supported optional Ollama fields now include `num_ctx`, `num_predict`, `seed`, and `stop`, in addition to existing `temperature`. Original question: - What else besides temperature is available? - Are any worth optimizing? Likely useful for Ollama: - `temperature` - `num_ctx` - `num_predict` - `seed` - `stop` - maybe `top_p`, `top_k`, `repeat_penalty` Recommendation: add only a small practical set first. Useful config shape: ```yaml temperature: 0.1 num_ctx: 8192 num_predict: 4096 seed: 1 ``` Most useful: - `num_ctx`: larger repo/task context - `num_predict`: caps runaway output - `seed`: reproducibility, if supported consistently - `temperature`: already useful; keep low for code - `stop`: could help enforce file-block or diff-only contracts Defer tuning `top_p`, `top_k`, and `repeat_penalty` unless a specific model needs it. reply: yup lets put this in the nightshift.yaml (optional parameters, if they arent in there that's fine, but we should offer them.) ## P1: Add Test Governance For Generated Tests Original idea: - Have a test governance layer for when agents write tests. - A reviewer validates alignment with acceptance criteria. Recommendation: yes, but only for generated-test mode. Do not put generated tests back into default DeadDrop yet. The previous failures proved test-writing agents will: - edit app code - import nonexistent modules - require undeclared dependencies - inspect implementation internals - write tests for future behavior Governance should be deterministic first, model-reviewed second. Deterministic checks: - test-writing stage may only touch `tests/` - tests compile - tests import only allowed public interfaces - tests do not import undeclared dependencies - tests do not define Flask routes or app implementation - test names match current task id or current artifact - no future-task keywords unless accepted by current task AC Then optional model reviewer checks acceptance-criteria alignment. ## P2: Add A Test Analyzer Agent For TDD Original idea: - Analyze tests. - Translate them into direct instructions for the implementer. - Maybe implement using agent YAML definitions without new NightShift features. Recommendation: viable, but defer until generated tests are stable. Possible pipeline: ```text write_tests -> validate_tests -> analyze_tests -> implement ``` Analyzer output should be concrete: ```text Implementation requirements: - create_app(database_path) must return a Flask app. - POST /snippets must return 201 and JSON id. - GET /snippets/ must return persisted fields. Do not modify: - tests/test_task001.py ``` This may help smaller models, but it is another model output that can be wrong. Add it only after the fixed-test pipeline works through all DeadDrop tasks. ## P2/P3: Add A Test Planner Original idea: - A test planner understands acceptance criteria and code. - Provides input to the next stage about constraints and code, especially for non-TDD. Recommendation: maybe, but defer. This overlaps with: - planner - test analyzer - test governance Too many planning-ish stages can make the pipeline bloated and contradictory. If implemented later, keep it focused: ```text test_planner -> write_tests -> test_governance -> implement ``` For now, fold this idea into the future test governance/analyzer work. ## P1: Add Fixed Tests For All DeadDrop Tasks Status: mostly implemented in the template. Current fixed tests: ```text tests/test_task001.py tests/test_task002.py tests/test_task003.py tests/test_task004.py tests/test_task005.py ``` Important design: ```yaml python -m pytest -q tests/test_{task_id_compact}.py ``` This lets all future task tests exist without breaking earlier tasks. Next step: validate these through integration runs, one task at a time. ## P1: Add `nightshift integ-report` Status: implemented as a first-pass artifact summarizer. New idea. Summarize latest integration run across tasks: ```text TASK-001 complete in 1 retry TASK-002 failed at validate_patch Root cause: protected tests modified Artifacts: ... ``` Right now we inspect artifacts manually. NightShift should do more of that. Possible command: ```powershell python -m nightshift.cli integ-report --latest ``` ## P1: Add Task-Test Preflight To `validate` Status: implemented. `nightshift validate` now renders task command placeholders for every task and fails early if a configured `tests/test_*.py` path is missing. Partially implemented at run time. Current behavior: - task command placeholders can render paths like `tests/test_task002.py` - `run_task` preflight fails before invoking agents if the task-specific test file is missing Better behavior: ```powershell nightshift validate ``` should warn or fail: ```text TASK-003 expects tests/test_task003.py and it exists. TASK-004 expects tests/test_task004.py and it exists. ``` This catches missing fixed tests earlier. ## P2: Add Run Comparison New idea. Useful once comparing 14B vs 30B: ```powershell nightshift compare-runs --latest 5 ``` Show: - model - task - retries - failure stage - final reason - runtime - token estimate This should come after `integ-test` and `integ-report`.