nightshift/docs/ideas.md
K. Hodges f7fed4535b Add tutorial integration workflow helpers
- Add `integ-test` to create, set up, validate, and run integration template tasks
  - Add `integ-report` to summarize latest integration run artifacts
  - Switch default pastebin template from model fallback to single `qwen3-coder:30b`
  - Support optional Ollama fields: `num_ctx`, `num_predict`, `seed`, and `stop`
  - Add `nightshift validate` preflight for task-specific test files
  - Update pastebin docs, config reference, and ideas tracking
  - Add tests for integration helpers, task-test validation, config parsing, and template expectations
2026-05-21 03:46:27 -07:00

8.8 KiB

Ideas TODO

This file is now prioritized inline. Priority scale:

  • P0: do next; directly improves current feedback loop
  • P1: important after the current loop is usable
  • P2: useful, but only after basics are stable
  • P3: defer or maybe reject

P0: Make Integration Tests Easy To Run

Status: implemented.

Implemented command:

python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001

It creates the integration sandbox, sets up the venv, runs validation through setup, runs the task from the generated project directory, and prints the artifact root. Use --dry-run to preview the setup and task command.

Running integration tests is still too manual.

Current process:

  • install the current version of NightShift
  • run python -m nightshift.cli integ-run --template tutorial-pastebin --setup
  • copy the activation line from the output and run it
  • cd into the generated directory
  • run the task there, because running from the repo root does not find nightshift.yaml

Recommendation: implement a wrapper command, not just a loose script.

Target command:

python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001

It should:

  1. create the integration run
  2. set up the venv
  3. install NightShift from the current checkout
  4. run nightshift validate
  5. run the selected task from the generated project directory
  6. print final status and artifact path

Useful variants:

python -m nightshift.cli integ-test --template tutorial-pastebin --all
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-002 --keep 3

The base-directory config issue may not be a core bug, but it is bad UX. The wrapper should handle cwd correctly.

P0/P1: Remove Multi-Candidate Workflow From Default Pastebin

Status: implemented for the default pastebin template and tutorial example.

Original idea:

  • The multi-candidate workflow does not add as much as expected.
  • Keep it as an example, maybe example-multiagent.

Recommendation: yes. Remove it from the default pastebin tutorial.

Reason:

  • Pastebin is becoming the reliability harness.
  • Multi-candidate fallback makes artifacts harder to reason about.
  • It adds model variability while we are still debugging pipeline behavior.

Better split:

tutorial-pastebin
tutorial-pastebin-multiagent

or:

examples/templates/multiagent-fallback

Default pastebin should be boring:

planner -> semantic_context -> context -> implement -> validate -> test -> review

Use one strong implementer first. Add fallback only in a separate experiment template.

P1: Add A Qwen3 / 30B Pastebin Variant

Status: implemented as the default pastebin model path using qwen3-coder:30b.

Original idea:

  • Use a non-coder model for planner roles.
  • Try qwen3.6:27b for planning.
  • Use qwen3-coder:30b for implementer and code-heavy roles.

Recommendation: viable, but make this a variant, not the default.

kass reply- No lets make this the default. the qwen3-coder:30b is fast now for me for some reason.

Suggested template/config:

tutorial-pastebin-qwen3

Possible role split:

  • planner: qwen3.6:27b
  • reviewer/debugger: qwen3.6:27b
  • implementer: qwen3-coder:30b or exact local 30B coder model name

Important: confirm exact model names with:

ollama list

i did its qwen3-coder:30b

Use 30B where it pays:

  • first implementation for hard tasks
  • repair after concrete test failure
  • schema/database changes
  • multi-file changes

Do not blindly make every stage 30B if it is slow.

reply: Its not slow now!qwen3-coder:30b

P2: Expose More Model Parameters

Status: implemented for the practical first set.

Supported optional Ollama fields now include num_ctx, num_predict, seed, and stop, in addition to existing temperature.

Original question:

  • What else besides temperature is available?
  • Are any worth optimizing?

Likely useful for Ollama:

  • temperature
  • num_ctx
  • num_predict
  • seed
  • stop
  • maybe top_p, top_k, repeat_penalty

Recommendation: add only a small practical set first.

Useful config shape:

temperature: 0.1
num_ctx: 8192
num_predict: 4096
seed: 1

Most useful:

  • num_ctx: larger repo/task context
  • num_predict: caps runaway output
  • seed: reproducibility, if supported consistently
  • temperature: already useful; keep low for code
  • stop: could help enforce file-block or diff-only contracts

Defer tuning top_p, top_k, and repeat_penalty unless a specific model needs it.

reply: yup lets put this in the nightshift.yaml (optional parameters, if they arent in there that's fine, but we should offer them.)

P1: Add Test Governance For Generated Tests

Original idea:

  • Have a test governance layer for when agents write tests.
  • A reviewer validates alignment with acceptance criteria.

Recommendation: yes, but only for generated-test mode. Do not put generated tests back into default pastebin yet.

The previous failures proved test-writing agents will:

  • edit app code
  • import nonexistent modules
  • require undeclared dependencies
  • inspect implementation internals
  • write tests for future behavior

Governance should be deterministic first, model-reviewed second.

Deterministic checks:

  • test-writing stage may only touch tests/
  • tests compile
  • tests import only allowed public interfaces
  • tests do not import undeclared dependencies
  • tests do not define Flask routes or app implementation
  • test names match current task id or current artifact
  • no future-task keywords unless accepted by current task AC

Then optional model reviewer checks acceptance-criteria alignment.

P2: Add A Test Analyzer Agent For TDD

Original idea:

  • Analyze tests.
  • Translate them into direct instructions for the implementer.
  • Maybe implement using agent YAML definitions without new NightShift features.

Recommendation: viable, but defer until generated tests are stable.

Possible pipeline:

write_tests -> validate_tests -> analyze_tests -> implement

Analyzer output should be concrete:

Implementation requirements:
- create_app(database_path) must return a Flask app.
- POST /snippets must return 201 and JSON id.
- GET /snippets/<id> must return persisted fields.

Do not modify:
- tests/test_task001.py

This may help smaller models, but it is another model output that can be wrong. Add it only after the fixed-test pipeline works through all pastebin tasks.

P2/P3: Add A Test Planner

Original idea:

  • A test planner understands acceptance criteria and code.
  • Provides input to the next stage about constraints and code, especially for non-TDD.

Recommendation: maybe, but defer.

This overlaps with:

  • planner
  • test analyzer
  • test governance

Too many planning-ish stages can make the pipeline bloated and contradictory.

If implemented later, keep it focused:

test_planner -> write_tests -> test_governance -> implement

For now, fold this idea into the future test governance/analyzer work.

P1: Add Fixed Tests For All Pastebin Tasks

Status: mostly implemented in the template.

Current fixed tests:

tests/test_task001.py
tests/test_task002.py
tests/test_task003.py
tests/test_task004.py
tests/test_task005.py

Important design:

python -m pytest -q tests/test_{task_id_compact}.py

This lets all future task tests exist without breaking earlier tasks.

Next step: validate these through integration runs, one task at a time.

P1: Add nightshift integ-report

Status: implemented as a first-pass artifact summarizer.

New idea.

Summarize latest integration run across tasks:

TASK-001 complete in 1 retry
TASK-002 failed at validate_patch
Root cause: protected tests modified
Artifacts: ...

Right now we inspect artifacts manually. NightShift should do more of that.

Possible command:

python -m nightshift.cli integ-report --latest

P1: Add Task-Test Preflight To validate

Status: implemented.

nightshift validate now renders task command placeholders for every task and fails early if a configured tests/test_*.py path is missing.

Partially implemented at run time.

Current behavior:

  • task command placeholders can render paths like tests/test_task002.py
  • run_task preflight fails before invoking agents if the task-specific test file is missing

Better behavior:

nightshift validate

should warn or fail:

TASK-003 expects tests/test_task003.py and it exists.
TASK-004 expects tests/test_task004.py and it exists.

This catches missing fixed tests earlier.

P2: Add Run Comparison

New idea.

Useful once comparing 14B vs 30B:

nightshift compare-runs --latest 5

Show:

  • model
  • task
  • retries
  • failure stage
  • final reason
  • runtime
  • token estimate

This should come after integ-test and integ-report.