mirror of
https://github.com/khodges42/nightShift.git
synced 2026-06-14 18:18:36 +00:00
- Add `integ-test` to create, set up, validate, and run integration template tasks - Add `integ-report` to summarize latest integration run artifacts - Switch default pastebin template from model fallback to single `qwen3-coder:30b` - Support optional Ollama fields: `num_ctx`, `num_predict`, `seed`, and `stop` - Add `nightshift validate` preflight for task-specific test files - Update pastebin docs, config reference, and ideas tracking - Add tests for integration helpers, task-test validation, config parsing, and template expectations
367 lines
8.8 KiB
Markdown
367 lines
8.8 KiB
Markdown
# Ideas TODO
|
|
|
|
This file is now prioritized inline. Priority scale:
|
|
|
|
- P0: do next; directly improves current feedback loop
|
|
- P1: important after the current loop is usable
|
|
- P2: useful, but only after basics are stable
|
|
- P3: defer or maybe reject
|
|
|
|
## P0: Make Integration Tests Easy To Run
|
|
|
|
Status: implemented.
|
|
|
|
Implemented command:
|
|
|
|
```powershell
|
|
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
|
|
```
|
|
|
|
It creates the integration sandbox, sets up the venv, runs validation through setup, runs the task from the generated project directory, and prints the artifact root. Use `--dry-run` to preview the setup and task command.
|
|
|
|
Running integration tests is still too manual.
|
|
|
|
Current process:
|
|
|
|
- install the current version of NightShift
|
|
- run `python -m nightshift.cli integ-run --template tutorial-pastebin --setup`
|
|
- copy the activation line from the output and run it
|
|
- `cd` into the generated directory
|
|
- run the task there, because running from the repo root does not find `nightshift.yaml`
|
|
|
|
Recommendation: implement a wrapper command, not just a loose script.
|
|
|
|
Target command:
|
|
|
|
```powershell
|
|
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
|
|
```
|
|
|
|
It should:
|
|
|
|
1. create the integration run
|
|
2. set up the venv
|
|
3. install NightShift from the current checkout
|
|
4. run `nightshift validate`
|
|
5. run the selected task from the generated project directory
|
|
6. print final status and artifact path
|
|
|
|
Useful variants:
|
|
|
|
```powershell
|
|
python -m nightshift.cli integ-test --template tutorial-pastebin --all
|
|
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-002 --keep 3
|
|
```
|
|
|
|
The base-directory config issue may not be a core bug, but it is bad UX. The wrapper should handle `cwd` correctly.
|
|
|
|
## P0/P1: Remove Multi-Candidate Workflow From Default Pastebin
|
|
|
|
Status: implemented for the default pastebin template and tutorial example.
|
|
|
|
Original idea:
|
|
|
|
- The multi-candidate workflow does not add as much as expected.
|
|
- Keep it as an example, maybe `example-multiagent`.
|
|
|
|
Recommendation: yes. Remove it from the default pastebin tutorial.
|
|
|
|
Reason:
|
|
|
|
- Pastebin is becoming the reliability harness.
|
|
- Multi-candidate fallback makes artifacts harder to reason about.
|
|
- It adds model variability while we are still debugging pipeline behavior.
|
|
|
|
Better split:
|
|
|
|
```text
|
|
tutorial-pastebin
|
|
tutorial-pastebin-multiagent
|
|
```
|
|
|
|
or:
|
|
|
|
```text
|
|
examples/templates/multiagent-fallback
|
|
```
|
|
|
|
Default pastebin should be boring:
|
|
|
|
```text
|
|
planner -> semantic_context -> context -> implement -> validate -> test -> review
|
|
```
|
|
|
|
Use one strong implementer first. Add fallback only in a separate experiment template.
|
|
|
|
## P1: Add A Qwen3 / 30B Pastebin Variant
|
|
|
|
Status: implemented as the default pastebin model path using `qwen3-coder:30b`.
|
|
|
|
Original idea:
|
|
|
|
- Use a non-coder model for planner roles.
|
|
- Try `qwen3.6:27b` for planning.
|
|
- Use `qwen3-coder:30b` for implementer and code-heavy roles.
|
|
|
|
Recommendation: viable, but make this a variant, not the default.
|
|
|
|
kass reply- No lets make this the default. the qwen3-coder:30b is fast now for me for some reason.
|
|
|
|
Suggested template/config:
|
|
|
|
```text
|
|
tutorial-pastebin-qwen3
|
|
```
|
|
|
|
Possible role split:
|
|
|
|
- planner: `qwen3.6:27b`
|
|
- reviewer/debugger: `qwen3.6:27b`
|
|
- implementer: `qwen3-coder:30b` or exact local 30B coder model name
|
|
|
|
Important: confirm exact model names with:
|
|
|
|
```powershell
|
|
ollama list
|
|
```
|
|
|
|
i did its `qwen3-coder:30b`
|
|
|
|
Use 30B where it pays:
|
|
|
|
- first implementation for hard tasks
|
|
- repair after concrete test failure
|
|
- schema/database changes
|
|
- multi-file changes
|
|
|
|
Do not blindly make every stage 30B if it is slow.
|
|
|
|
reply: Its not slow now!`qwen3-coder:30b`
|
|
|
|
## P2: Expose More Model Parameters
|
|
|
|
Status: implemented for the practical first set.
|
|
|
|
Supported optional Ollama fields now include `num_ctx`, `num_predict`, `seed`, and `stop`, in addition to existing `temperature`.
|
|
|
|
Original question:
|
|
|
|
- What else besides temperature is available?
|
|
- Are any worth optimizing?
|
|
|
|
Likely useful for Ollama:
|
|
|
|
- `temperature`
|
|
- `num_ctx`
|
|
- `num_predict`
|
|
- `seed`
|
|
- `stop`
|
|
- maybe `top_p`, `top_k`, `repeat_penalty`
|
|
|
|
Recommendation: add only a small practical set first.
|
|
|
|
Useful config shape:
|
|
|
|
```yaml
|
|
temperature: 0.1
|
|
num_ctx: 8192
|
|
num_predict: 4096
|
|
seed: 1
|
|
```
|
|
|
|
Most useful:
|
|
|
|
- `num_ctx`: larger repo/task context
|
|
- `num_predict`: caps runaway output
|
|
- `seed`: reproducibility, if supported consistently
|
|
- `temperature`: already useful; keep low for code
|
|
- `stop`: could help enforce file-block or diff-only contracts
|
|
|
|
Defer tuning `top_p`, `top_k`, and `repeat_penalty` unless a specific model needs it.
|
|
|
|
reply: yup lets put this in the nightshift.yaml (optional parameters, if they arent in there that's fine, but we should offer them.)
|
|
|
|
## P1: Add Test Governance For Generated Tests
|
|
|
|
Original idea:
|
|
|
|
- Have a test governance layer for when agents write tests.
|
|
- A reviewer validates alignment with acceptance criteria.
|
|
|
|
Recommendation: yes, but only for generated-test mode. Do not put generated tests back into default pastebin yet.
|
|
|
|
The previous failures proved test-writing agents will:
|
|
|
|
- edit app code
|
|
- import nonexistent modules
|
|
- require undeclared dependencies
|
|
- inspect implementation internals
|
|
- write tests for future behavior
|
|
|
|
Governance should be deterministic first, model-reviewed second.
|
|
|
|
Deterministic checks:
|
|
|
|
- test-writing stage may only touch `tests/`
|
|
- tests compile
|
|
- tests import only allowed public interfaces
|
|
- tests do not import undeclared dependencies
|
|
- tests do not define Flask routes or app implementation
|
|
- test names match current task id or current artifact
|
|
- no future-task keywords unless accepted by current task AC
|
|
|
|
Then optional model reviewer checks acceptance-criteria alignment.
|
|
|
|
## P2: Add A Test Analyzer Agent For TDD
|
|
|
|
Original idea:
|
|
|
|
- Analyze tests.
|
|
- Translate them into direct instructions for the implementer.
|
|
- Maybe implement using agent YAML definitions without new NightShift features.
|
|
|
|
Recommendation: viable, but defer until generated tests are stable.
|
|
|
|
Possible pipeline:
|
|
|
|
```text
|
|
write_tests -> validate_tests -> analyze_tests -> implement
|
|
```
|
|
|
|
Analyzer output should be concrete:
|
|
|
|
```text
|
|
Implementation requirements:
|
|
- create_app(database_path) must return a Flask app.
|
|
- POST /snippets must return 201 and JSON id.
|
|
- GET /snippets/<id> must return persisted fields.
|
|
|
|
Do not modify:
|
|
- tests/test_task001.py
|
|
```
|
|
|
|
This may help smaller models, but it is another model output that can be wrong. Add it only after the fixed-test pipeline works through all pastebin tasks.
|
|
|
|
## P2/P3: Add A Test Planner
|
|
|
|
Original idea:
|
|
|
|
- A test planner understands acceptance criteria and code.
|
|
- Provides input to the next stage about constraints and code, especially for non-TDD.
|
|
|
|
Recommendation: maybe, but defer.
|
|
|
|
This overlaps with:
|
|
|
|
- planner
|
|
- test analyzer
|
|
- test governance
|
|
|
|
Too many planning-ish stages can make the pipeline bloated and contradictory.
|
|
|
|
If implemented later, keep it focused:
|
|
|
|
```text
|
|
test_planner -> write_tests -> test_governance -> implement
|
|
```
|
|
|
|
For now, fold this idea into the future test governance/analyzer work.
|
|
|
|
## P1: Add Fixed Tests For All Pastebin Tasks
|
|
|
|
Status: mostly implemented in the template.
|
|
|
|
Current fixed tests:
|
|
|
|
```text
|
|
tests/test_task001.py
|
|
tests/test_task002.py
|
|
tests/test_task003.py
|
|
tests/test_task004.py
|
|
tests/test_task005.py
|
|
```
|
|
|
|
Important design:
|
|
|
|
```yaml
|
|
python -m pytest -q tests/test_{task_id_compact}.py
|
|
```
|
|
|
|
This lets all future task tests exist without breaking earlier tasks.
|
|
|
|
Next step: validate these through integration runs, one task at a time.
|
|
|
|
## P1: Add `nightshift integ-report`
|
|
|
|
Status: implemented as a first-pass artifact summarizer.
|
|
|
|
New idea.
|
|
|
|
Summarize latest integration run across tasks:
|
|
|
|
```text
|
|
TASK-001 complete in 1 retry
|
|
TASK-002 failed at validate_patch
|
|
Root cause: protected tests modified
|
|
Artifacts: ...
|
|
```
|
|
|
|
Right now we inspect artifacts manually. NightShift should do more of that.
|
|
|
|
Possible command:
|
|
|
|
```powershell
|
|
python -m nightshift.cli integ-report --latest
|
|
```
|
|
|
|
## P1: Add Task-Test Preflight To `validate`
|
|
|
|
Status: implemented.
|
|
|
|
`nightshift validate` now renders task command placeholders for every task and fails early if a configured `tests/test_*.py` path is missing.
|
|
|
|
Partially implemented at run time.
|
|
|
|
Current behavior:
|
|
|
|
- task command placeholders can render paths like `tests/test_task002.py`
|
|
- `run_task` preflight fails before invoking agents if the task-specific test file is missing
|
|
|
|
Better behavior:
|
|
|
|
```powershell
|
|
nightshift validate
|
|
```
|
|
|
|
should warn or fail:
|
|
|
|
```text
|
|
TASK-003 expects tests/test_task003.py and it exists.
|
|
TASK-004 expects tests/test_task004.py and it exists.
|
|
```
|
|
|
|
This catches missing fixed tests earlier.
|
|
|
|
## P2: Add Run Comparison
|
|
|
|
New idea.
|
|
|
|
Useful once comparing 14B vs 30B:
|
|
|
|
```powershell
|
|
nightshift compare-runs --latest 5
|
|
```
|
|
|
|
Show:
|
|
|
|
- model
|
|
- task
|
|
- retries
|
|
- failure stage
|
|
- final reason
|
|
- runtime
|
|
- token estimate
|
|
|
|
This should come after `integ-test` and `integ-report`.
|
|
|