- Add `integ-test` to create, set up, validate, and run integration template tasks - Add `integ-report` to summarize latest integration run artifacts - Switch default pastebin template from model fallback to single `qwen3-coder:30b` - Support optional Ollama fields: `num_ctx`, `num_predict`, `seed`, and `stop` - Add `nightshift validate` preflight for task-specific test files - Update pastebin docs, config reference, and ideas tracking - Add tests for integration helpers, task-test validation, config parsing, and template expectations
8.8 KiB
Ideas TODO
This file is now prioritized inline. Priority scale:
- P0: do next; directly improves current feedback loop
- P1: important after the current loop is usable
- P2: useful, but only after basics are stable
- P3: defer or maybe reject
P0: Make Integration Tests Easy To Run
Status: implemented.
Implemented command:
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
It creates the integration sandbox, sets up the venv, runs validation through setup, runs the task from the generated project directory, and prints the artifact root. Use --dry-run to preview the setup and task command.
Running integration tests is still too manual.
Current process:
- install the current version of NightShift
- run
python -m nightshift.cli integ-run --template tutorial-pastebin --setup - copy the activation line from the output and run it
cdinto the generated directory- run the task there, because running from the repo root does not find
nightshift.yaml
Recommendation: implement a wrapper command, not just a loose script.
Target command:
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
It should:
- create the integration run
- set up the venv
- install NightShift from the current checkout
- run
nightshift validate - run the selected task from the generated project directory
- print final status and artifact path
Useful variants:
python -m nightshift.cli integ-test --template tutorial-pastebin --all
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-002 --keep 3
The base-directory config issue may not be a core bug, but it is bad UX. The wrapper should handle cwd correctly.
P0/P1: Remove Multi-Candidate Workflow From Default Pastebin
Status: implemented for the default pastebin template and tutorial example.
Original idea:
- The multi-candidate workflow does not add as much as expected.
- Keep it as an example, maybe
example-multiagent.
Recommendation: yes. Remove it from the default pastebin tutorial.
Reason:
- Pastebin is becoming the reliability harness.
- Multi-candidate fallback makes artifacts harder to reason about.
- It adds model variability while we are still debugging pipeline behavior.
Better split:
tutorial-pastebin
tutorial-pastebin-multiagent
or:
examples/templates/multiagent-fallback
Default pastebin should be boring:
planner -> semantic_context -> context -> implement -> validate -> test -> review
Use one strong implementer first. Add fallback only in a separate experiment template.
P1: Add A Qwen3 / 30B Pastebin Variant
Status: implemented as the default pastebin model path using qwen3-coder:30b.
Original idea:
- Use a non-coder model for planner roles.
- Try
qwen3.6:27bfor planning. - Use
qwen3-coder:30bfor implementer and code-heavy roles.
Recommendation: viable, but make this a variant, not the default.
kass reply- No lets make this the default. the qwen3-coder:30b is fast now for me for some reason.
Suggested template/config:
tutorial-pastebin-qwen3
Possible role split:
- planner:
qwen3.6:27b - reviewer/debugger:
qwen3.6:27b - implementer:
qwen3-coder:30bor exact local 30B coder model name
Important: confirm exact model names with:
ollama list
i did its qwen3-coder:30b
Use 30B where it pays:
- first implementation for hard tasks
- repair after concrete test failure
- schema/database changes
- multi-file changes
Do not blindly make every stage 30B if it is slow.
reply: Its not slow now!qwen3-coder:30b
P2: Expose More Model Parameters
Status: implemented for the practical first set.
Supported optional Ollama fields now include num_ctx, num_predict, seed, and stop, in addition to existing temperature.
Original question:
- What else besides temperature is available?
- Are any worth optimizing?
Likely useful for Ollama:
temperaturenum_ctxnum_predictseedstop- maybe
top_p,top_k,repeat_penalty
Recommendation: add only a small practical set first.
Useful config shape:
temperature: 0.1
num_ctx: 8192
num_predict: 4096
seed: 1
Most useful:
num_ctx: larger repo/task contextnum_predict: caps runaway outputseed: reproducibility, if supported consistentlytemperature: already useful; keep low for codestop: could help enforce file-block or diff-only contracts
Defer tuning top_p, top_k, and repeat_penalty unless a specific model needs it.
reply: yup lets put this in the nightshift.yaml (optional parameters, if they arent in there that's fine, but we should offer them.)
P1: Add Test Governance For Generated Tests
Original idea:
- Have a test governance layer for when agents write tests.
- A reviewer validates alignment with acceptance criteria.
Recommendation: yes, but only for generated-test mode. Do not put generated tests back into default pastebin yet.
The previous failures proved test-writing agents will:
- edit app code
- import nonexistent modules
- require undeclared dependencies
- inspect implementation internals
- write tests for future behavior
Governance should be deterministic first, model-reviewed second.
Deterministic checks:
- test-writing stage may only touch
tests/ - tests compile
- tests import only allowed public interfaces
- tests do not import undeclared dependencies
- tests do not define Flask routes or app implementation
- test names match current task id or current artifact
- no future-task keywords unless accepted by current task AC
Then optional model reviewer checks acceptance-criteria alignment.
P2: Add A Test Analyzer Agent For TDD
Original idea:
- Analyze tests.
- Translate them into direct instructions for the implementer.
- Maybe implement using agent YAML definitions without new NightShift features.
Recommendation: viable, but defer until generated tests are stable.
Possible pipeline:
write_tests -> validate_tests -> analyze_tests -> implement
Analyzer output should be concrete:
Implementation requirements:
- create_app(database_path) must return a Flask app.
- POST /snippets must return 201 and JSON id.
- GET /snippets/<id> must return persisted fields.
Do not modify:
- tests/test_task001.py
This may help smaller models, but it is another model output that can be wrong. Add it only after the fixed-test pipeline works through all pastebin tasks.
P2/P3: Add A Test Planner
Original idea:
- A test planner understands acceptance criteria and code.
- Provides input to the next stage about constraints and code, especially for non-TDD.
Recommendation: maybe, but defer.
This overlaps with:
- planner
- test analyzer
- test governance
Too many planning-ish stages can make the pipeline bloated and contradictory.
If implemented later, keep it focused:
test_planner -> write_tests -> test_governance -> implement
For now, fold this idea into the future test governance/analyzer work.
P1: Add Fixed Tests For All Pastebin Tasks
Status: mostly implemented in the template.
Current fixed tests:
tests/test_task001.py
tests/test_task002.py
tests/test_task003.py
tests/test_task004.py
tests/test_task005.py
Important design:
python -m pytest -q tests/test_{task_id_compact}.py
This lets all future task tests exist without breaking earlier tasks.
Next step: validate these through integration runs, one task at a time.
P1: Add nightshift integ-report
Status: implemented as a first-pass artifact summarizer.
New idea.
Summarize latest integration run across tasks:
TASK-001 complete in 1 retry
TASK-002 failed at validate_patch
Root cause: protected tests modified
Artifacts: ...
Right now we inspect artifacts manually. NightShift should do more of that.
Possible command:
python -m nightshift.cli integ-report --latest
P1: Add Task-Test Preflight To validate
Status: implemented.
nightshift validate now renders task command placeholders for every task and fails early if a configured tests/test_*.py path is missing.
Partially implemented at run time.
Current behavior:
- task command placeholders can render paths like
tests/test_task002.py run_taskpreflight fails before invoking agents if the task-specific test file is missing
Better behavior:
nightshift validate
should warn or fail:
TASK-003 expects tests/test_task003.py and it exists.
TASK-004 expects tests/test_task004.py and it exists.
This catches missing fixed tests earlier.
P2: Add Run Comparison
New idea.
Useful once comparing 14B vs 30B:
nightshift compare-runs --latest 5
Show:
- model
- task
- retries
- failure stage
- final reason
- runtime
- token estimate
This should come after integ-test and integ-report.