mirror of
https://github.com/khodges42/nightShift.git
synced 2026-06-14 10:08:37 +00:00
Add tutorial integration workflow helpers
- Add `integ-test` to create, set up, validate, and run integration template tasks - Add `integ-report` to summarize latest integration run artifacts - Switch default pastebin template from model fallback to single `qwen3-coder:30b` - Support optional Ollama fields: `num_ctx`, `num_predict`, `seed`, and `stop` - Add `nightshift validate` preflight for task-specific test files - Update pastebin docs, config reference, and ideas tracking - Add tests for integration helpers, task-test validation, config parsing, and template expectations
This commit is contained in:
parent
e3679296fd
commit
f7fed4535b
|
|
@ -1,195 +0,0 @@
|
|||
# Bugfix TODO
|
||||
|
||||
## Some issues going with run --all
|
||||
reason=Stage 'review' requested unknown next stage 'None'. Not every time. I think there's a pattern that is out of place here. Maybe it's related to the last task success? Or the last run?
|
||||
|
||||
|
||||
|
||||
## Going from individual tasks to --all fails
|
||||
|
||||
If you do nightshift run --task TASK-001 and then that completes and then you go to nightshift run --all it fails on blocked by missing dependencies: TASK-001 . I think this is because the tasks get reset at the top of the run, but there is something marking completion of TASK-001 requiring manual reset.
|
||||
|
||||
run --all should start at the first not done task (seems like it does)
|
||||
|
||||
## Some kind of tool install feature
|
||||
|
||||
Continually fails on flask_sqlalchemy until I install that.
|
||||
|
||||
## Tutorial need to include . directory for imageboard
|
||||
|
||||
## Git status artifacts are noisy for non-git repositories
|
||||
|
||||
Observed artifact:
|
||||
|
||||
```text
|
||||
# Git Status before
|
||||
|
||||
Available: false
|
||||
Exit code: 128
|
||||
|
||||
fatal: not a git repository (or any of the parent directories): .git
|
||||
```
|
||||
|
||||
Current behavior:
|
||||
|
||||
- NightShift continues when `require_clean_worktree: false`.
|
||||
- `git-status-before.txt`, `git-status-after.txt`, and `diff.patch` may contain git errors.
|
||||
- This is technically safe, but confusing for users running quickstart/demo projects outside git.
|
||||
|
||||
Desired behavior:
|
||||
|
||||
- Detect non-git repositories explicitly.
|
||||
- Write a clearer artifact message such as:
|
||||
|
||||
```text
|
||||
Git repository: false
|
||||
Clean-worktree enforcement: skipped because require_clean_worktree is false
|
||||
Diff artifact: unavailable because project is not a git repository
|
||||
```
|
||||
|
||||
- Avoid treating non-git as a scary-looking failure when clean worktree is not required.
|
||||
|
||||
Acceptance criteria:
|
||||
|
||||
- Non-git projects produce readable git artifacts without fatal-looking output.
|
||||
- `require_clean_worktree: true` still fails safely in non-git projects.
|
||||
- Reports mention that git metadata/diff is unavailable because the project is not a git repo.
|
||||
|
||||
## Git safe.directory / ownership conflicts on Windows
|
||||
|
||||
Observed context:
|
||||
|
||||
- Git can report dubious ownership or safe-directory errors when a repo was created or managed by a different Windows user identity.
|
||||
- This may happen when using GitHub Desktop, WSL, admin shells, or multiple Windows accounts.
|
||||
|
||||
Current behavior:
|
||||
|
||||
- NightShift records the raw git error in artifacts.
|
||||
- If `require_clean_worktree: true`, NightShift blocks execution.
|
||||
- If `require_clean_worktree: false`, NightShift continues but git status/diff artifacts can look like hard failures.
|
||||
|
||||
Desired behavior:
|
||||
|
||||
- Detect common `dubious ownership` / `safe.directory` messages.
|
||||
- Write a clearer explanation in artifacts and reports.
|
||||
- Suggest the exact remediation outside NightShift, for example:
|
||||
|
||||
```powershell
|
||||
git config --global --add safe.directory <project-root>
|
||||
```
|
||||
|
||||
Acceptance criteria:
|
||||
|
||||
- Safe-directory failures are classified separately from ordinary git failures.
|
||||
- Users get actionable guidance.
|
||||
- NightShift does not attempt to change global git config automatically.
|
||||
|
||||
## Clarify docs around git requirements
|
||||
|
||||
Add to `QUICKSTART.md` and troubleshooting:
|
||||
|
||||
- Git is optional when `require_clean_worktree: false`.
|
||||
- Git is required for clean-worktree enforcement and useful diffs.
|
||||
- Non-git projects can still run pipelines.
|
||||
- Git ownership/safe-directory errors affect git artifacts, not core task execution, unless clean-worktree enforcement is enabled.
|
||||
|
||||
## Console appears idle during long agent calls
|
||||
|
||||
Current behavior:
|
||||
|
||||
- Long Ollama calls can make `nightshift run` look frozen.
|
||||
- Progress is only visible by inspecting `.nightshift/` artifacts or `ollama ps`.
|
||||
|
||||
Desired behavior:
|
||||
|
||||
- Print stage start/finish messages to the console.
|
||||
- Include agent id, stage id, task id, and artifact path when available.
|
||||
- Do not stream model output yet; just show lifecycle progress.
|
||||
|
||||
Acceptance criteria:
|
||||
|
||||
- User can tell which stage is running.
|
||||
- Long-running model calls no longer look like a hung process.
|
||||
|
||||
## Ollama output can make review stages fail if not structured
|
||||
|
||||
Current behavior:
|
||||
|
||||
- Review stages require `status: pass | fail | retry | escalate`.
|
||||
- General-purpose model output may include prose before/after the structured fields.
|
||||
- If no valid status is found, the review stage fails.
|
||||
|
||||
Desired behavior:
|
||||
|
||||
- Keep strict structured review parsing, but improve prompt templates and error messages.
|
||||
- Artifact should clearly say the review output was unparseable and show the expected contract.
|
||||
|
||||
Acceptance criteria:
|
||||
|
||||
- Failed review parsing is easy to diagnose from `review.md` and `stage-results.md`.
|
||||
|
||||
## `echo` fake agents do not behave consistently across shells
|
||||
|
||||
Current behavior:
|
||||
|
||||
- Starter templates use `command: echo`.
|
||||
- Depending on shell/platform, `echo` may not preserve stdin or may only echo arguments.
|
||||
- This can make fake agent artifacts less useful.
|
||||
|
||||
Desired behavior:
|
||||
|
||||
- Replace fake-agent defaults with small Python one-liners or documented fake-agent scripts.
|
||||
- Keep examples cross-platform.
|
||||
|
||||
Acceptance criteria:
|
||||
|
||||
- Starter project produces predictable fake-agent output on Windows PowerShell/cmd and Unix shells.
|
||||
|
||||
## `unittest discover` behavior depends on test package layout
|
||||
|
||||
Current behavior:
|
||||
|
||||
- Python 3.14 returned `NO TESTS RAN` with exit code 5 for an example project until `tests/__init__.py` was added.
|
||||
- Users may hit the same issue in fresh target repos.
|
||||
|
||||
Desired behavior:
|
||||
|
||||
- Document this in troubleshooting.
|
||||
- Consider making quickstart templates include `tests/__init__.py`.
|
||||
|
||||
Acceptance criteria:
|
||||
|
||||
- Quickstart test command works in a fresh copied example.
|
||||
- Troubleshooting mentions what to do if `NO TESTS RAN` appears.
|
||||
|
||||
## Task completion can mark tasks complete even if no source changed
|
||||
|
||||
Current behavior:
|
||||
|
||||
- A pipeline can pass with fake agents and passing tests, then mark the task complete.
|
||||
- This is expected for fake/demo mode but surprising when users expect code edits.
|
||||
|
||||
Desired behavior:
|
||||
|
||||
- Add a warning when a task completes and git/diff detects no source changes, where git is available.
|
||||
- Documentation should explain fake-agent mode vs editing-agent mode.
|
||||
|
||||
Acceptance criteria:
|
||||
|
||||
- Users are less likely to mistake artifact generation for code modification.
|
||||
|
||||
## Dashboard requires Flask but dependency is optional
|
||||
|
||||
Current behavior:
|
||||
|
||||
- `nightshift web` fails with a helpful message if Flask is missing.
|
||||
- README mentions `pip install flask`, but install extras are not defined.
|
||||
|
||||
Desired behavior:
|
||||
|
||||
- Add an optional dependency group such as `nightshift[web]` later.
|
||||
- Keep graceful error behavior.
|
||||
|
||||
Acceptance criteria:
|
||||
|
||||
- Users have one documented install command for dashboard support.
|
||||
|
|
@ -62,11 +62,19 @@ Ollama agent:
|
|||
```yaml
|
||||
planner:
|
||||
backend: ollama
|
||||
model: qwen2.5-coder:14b
|
||||
model: qwen3-coder:30b
|
||||
base_url: http://localhost:11434
|
||||
system_prompt: agents/planner.md
|
||||
temperature: 0.2
|
||||
num_ctx: 8192
|
||||
num_predict: 4096
|
||||
seed: 1
|
||||
stop:
|
||||
- STOP
|
||||
```
|
||||
|
||||
Optional Ollama generation options currently supported by NightShift are `temperature`, `num_ctx`, `num_predict`, `seed`, and `stop`.
|
||||
|
||||
## `pipeline`
|
||||
|
||||
- `max_task_retries`: task retry limit.
|
||||
|
|
@ -76,6 +84,7 @@ planner:
|
|||
Command stage options:
|
||||
|
||||
- `commands`: command strings.
|
||||
- Command strings may use task placeholders: `{task_id}`, `{task_id_lower}`, `{task_id_slug}`, and `{task_id_compact}`.
|
||||
- `shell`: defaults to true. Set false for argv-style execution.
|
||||
- `timeout_seconds`: per-stage timeout override.
|
||||
- `working_dir`: command working directory inside project root.
|
||||
|
|
@ -141,6 +150,12 @@ Create a local integration sandbox from the NightShift repository root:
|
|||
python -m nightshift.cli integ-run --template tutorial-pastebin
|
||||
```
|
||||
|
||||
Create, set up, validate, and run one task from the generated project directory:
|
||||
|
||||
```bash
|
||||
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
|
||||
```
|
||||
|
||||
Set up the generated Python project:
|
||||
|
||||
```bash
|
||||
|
|
@ -161,6 +176,12 @@ Preview commands without running them:
|
|||
python -m nightshift.cli integ-setup --project integ_runs/<timestamp>/project --dry-run
|
||||
```
|
||||
|
||||
Summarize the latest integration artifact run:
|
||||
|
||||
```bash
|
||||
python -m nightshift.cli integ-report --latest
|
||||
```
|
||||
|
||||
To clean up old sandboxes before creating a new one, keep only the newest three existing runs:
|
||||
|
||||
```bash
|
||||
|
|
@ -169,8 +190,4 @@ python -m nightshift.cli integ-run --template tutorial-pastebin --keep 3
|
|||
|
||||
## Pastebin Tutorial
|
||||
|
||||
`nightshift init --template tutorial-pastebin` creates a small Flask snippet-hosting target with deterministic tests and incremental NightShift tasks. Its pipeline includes semantic context retrieval, telemetry, debugger support, and implementation fallback order:
|
||||
|
||||
- `qwen2.5-coder:14b`
|
||||
- `carstenuhlig/omnicoder-9b`
|
||||
- `deepseek-coder-v2:16b`
|
||||
`nightshift init --template tutorial-pastebin` creates a small Flask snippet-hosting target with deterministic tests and incremental NightShift tasks. Its pipeline includes semantic context retrieval, telemetry, debugger support, fixed task-specific tests, and a single default `qwen3-coder:30b` model path.
|
||||
|
|
|
|||
17
docs/future_ideas.md
Normal file
17
docs/future_ideas.md
Normal file
|
|
@ -0,0 +1,17 @@
|
|||
### Future Ideas
|
||||
Not to implement until we get successful long running runs.
|
||||
|
||||
## I am realizing "templates" are abstracted from the user
|
||||
* I think templates will be a first class citizen, a package for deployments, and a harness for performance tests
|
||||
* These should live external to nightshift/project_templates as users will likely create their own
|
||||
* one solution would be to reference two directories when looking up templates, builtin ones will be in nightshift/project_templates or users can define a templates directory in their nightshift config
|
||||
|
||||
## nightshift config
|
||||
* store user settings in ~/.nightshift/config.yaml
|
||||
* things like templates folder (can also live here)
|
||||
* maybe this is later
|
||||
|
||||
## A way to easily make A/B tests to benchmark models?
|
||||
* Right now I can do this manually, for example I want to run the tutorial-pastebin with qwen3.6:27b as the planner and qwen2.5-coder:14b as the coder, and another with qwen3.6:27b as both, etc.
|
||||
* Maybe there is a way to make it easier to do that, possibly by creating a template that can be controlled by a larger multi-run file?
|
||||
* This is probably for way later.
|
||||
366
docs/ideas.md
Normal file
366
docs/ideas.md
Normal file
|
|
@ -0,0 +1,366 @@
|
|||
# Ideas TODO
|
||||
|
||||
This file is now prioritized inline. Priority scale:
|
||||
|
||||
- P0: do next; directly improves current feedback loop
|
||||
- P1: important after the current loop is usable
|
||||
- P2: useful, but only after basics are stable
|
||||
- P3: defer or maybe reject
|
||||
|
||||
## P0: Make Integration Tests Easy To Run
|
||||
|
||||
Status: implemented.
|
||||
|
||||
Implemented command:
|
||||
|
||||
```powershell
|
||||
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
|
||||
```
|
||||
|
||||
It creates the integration sandbox, sets up the venv, runs validation through setup, runs the task from the generated project directory, and prints the artifact root. Use `--dry-run` to preview the setup and task command.
|
||||
|
||||
Running integration tests is still too manual.
|
||||
|
||||
Current process:
|
||||
|
||||
- install the current version of NightShift
|
||||
- run `python -m nightshift.cli integ-run --template tutorial-pastebin --setup`
|
||||
- copy the activation line from the output and run it
|
||||
- `cd` into the generated directory
|
||||
- run the task there, because running from the repo root does not find `nightshift.yaml`
|
||||
|
||||
Recommendation: implement a wrapper command, not just a loose script.
|
||||
|
||||
Target command:
|
||||
|
||||
```powershell
|
||||
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
|
||||
```
|
||||
|
||||
It should:
|
||||
|
||||
1. create the integration run
|
||||
2. set up the venv
|
||||
3. install NightShift from the current checkout
|
||||
4. run `nightshift validate`
|
||||
5. run the selected task from the generated project directory
|
||||
6. print final status and artifact path
|
||||
|
||||
Useful variants:
|
||||
|
||||
```powershell
|
||||
python -m nightshift.cli integ-test --template tutorial-pastebin --all
|
||||
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-002 --keep 3
|
||||
```
|
||||
|
||||
The base-directory config issue may not be a core bug, but it is bad UX. The wrapper should handle `cwd` correctly.
|
||||
|
||||
## P0/P1: Remove Multi-Candidate Workflow From Default Pastebin
|
||||
|
||||
Status: implemented for the default pastebin template and tutorial example.
|
||||
|
||||
Original idea:
|
||||
|
||||
- The multi-candidate workflow does not add as much as expected.
|
||||
- Keep it as an example, maybe `example-multiagent`.
|
||||
|
||||
Recommendation: yes. Remove it from the default pastebin tutorial.
|
||||
|
||||
Reason:
|
||||
|
||||
- Pastebin is becoming the reliability harness.
|
||||
- Multi-candidate fallback makes artifacts harder to reason about.
|
||||
- It adds model variability while we are still debugging pipeline behavior.
|
||||
|
||||
Better split:
|
||||
|
||||
```text
|
||||
tutorial-pastebin
|
||||
tutorial-pastebin-multiagent
|
||||
```
|
||||
|
||||
or:
|
||||
|
||||
```text
|
||||
examples/templates/multiagent-fallback
|
||||
```
|
||||
|
||||
Default pastebin should be boring:
|
||||
|
||||
```text
|
||||
planner -> semantic_context -> context -> implement -> validate -> test -> review
|
||||
```
|
||||
|
||||
Use one strong implementer first. Add fallback only in a separate experiment template.
|
||||
|
||||
## P1: Add A Qwen3 / 30B Pastebin Variant
|
||||
|
||||
Status: implemented as the default pastebin model path using `qwen3-coder:30b`.
|
||||
|
||||
Original idea:
|
||||
|
||||
- Use a non-coder model for planner roles.
|
||||
- Try `qwen3.6:27b` for planning.
|
||||
- Use `qwen3-coder:30b` for implementer and code-heavy roles.
|
||||
|
||||
Recommendation: viable, but make this a variant, not the default.
|
||||
|
||||
kass reply- No lets make this the default. the qwen3-coder:30b is fast now for me for some reason.
|
||||
|
||||
Suggested template/config:
|
||||
|
||||
```text
|
||||
tutorial-pastebin-qwen3
|
||||
```
|
||||
|
||||
Possible role split:
|
||||
|
||||
- planner: `qwen3.6:27b`
|
||||
- reviewer/debugger: `qwen3.6:27b`
|
||||
- implementer: `qwen3-coder:30b` or exact local 30B coder model name
|
||||
|
||||
Important: confirm exact model names with:
|
||||
|
||||
```powershell
|
||||
ollama list
|
||||
```
|
||||
|
||||
i did its `qwen3-coder:30b`
|
||||
|
||||
Use 30B where it pays:
|
||||
|
||||
- first implementation for hard tasks
|
||||
- repair after concrete test failure
|
||||
- schema/database changes
|
||||
- multi-file changes
|
||||
|
||||
Do not blindly make every stage 30B if it is slow.
|
||||
|
||||
reply: Its not slow now!`qwen3-coder:30b`
|
||||
|
||||
## P2: Expose More Model Parameters
|
||||
|
||||
Status: implemented for the practical first set.
|
||||
|
||||
Supported optional Ollama fields now include `num_ctx`, `num_predict`, `seed`, and `stop`, in addition to existing `temperature`.
|
||||
|
||||
Original question:
|
||||
|
||||
- What else besides temperature is available?
|
||||
- Are any worth optimizing?
|
||||
|
||||
Likely useful for Ollama:
|
||||
|
||||
- `temperature`
|
||||
- `num_ctx`
|
||||
- `num_predict`
|
||||
- `seed`
|
||||
- `stop`
|
||||
- maybe `top_p`, `top_k`, `repeat_penalty`
|
||||
|
||||
Recommendation: add only a small practical set first.
|
||||
|
||||
Useful config shape:
|
||||
|
||||
```yaml
|
||||
temperature: 0.1
|
||||
num_ctx: 8192
|
||||
num_predict: 4096
|
||||
seed: 1
|
||||
```
|
||||
|
||||
Most useful:
|
||||
|
||||
- `num_ctx`: larger repo/task context
|
||||
- `num_predict`: caps runaway output
|
||||
- `seed`: reproducibility, if supported consistently
|
||||
- `temperature`: already useful; keep low for code
|
||||
- `stop`: could help enforce file-block or diff-only contracts
|
||||
|
||||
Defer tuning `top_p`, `top_k`, and `repeat_penalty` unless a specific model needs it.
|
||||
|
||||
reply: yup lets put this in the nightshift.yaml (optional parameters, if they arent in there that's fine, but we should offer them.)
|
||||
|
||||
## P1: Add Test Governance For Generated Tests
|
||||
|
||||
Original idea:
|
||||
|
||||
- Have a test governance layer for when agents write tests.
|
||||
- A reviewer validates alignment with acceptance criteria.
|
||||
|
||||
Recommendation: yes, but only for generated-test mode. Do not put generated tests back into default pastebin yet.
|
||||
|
||||
The previous failures proved test-writing agents will:
|
||||
|
||||
- edit app code
|
||||
- import nonexistent modules
|
||||
- require undeclared dependencies
|
||||
- inspect implementation internals
|
||||
- write tests for future behavior
|
||||
|
||||
Governance should be deterministic first, model-reviewed second.
|
||||
|
||||
Deterministic checks:
|
||||
|
||||
- test-writing stage may only touch `tests/`
|
||||
- tests compile
|
||||
- tests import only allowed public interfaces
|
||||
- tests do not import undeclared dependencies
|
||||
- tests do not define Flask routes or app implementation
|
||||
- test names match current task id or current artifact
|
||||
- no future-task keywords unless accepted by current task AC
|
||||
|
||||
Then optional model reviewer checks acceptance-criteria alignment.
|
||||
|
||||
## P2: Add A Test Analyzer Agent For TDD
|
||||
|
||||
Original idea:
|
||||
|
||||
- Analyze tests.
|
||||
- Translate them into direct instructions for the implementer.
|
||||
- Maybe implement using agent YAML definitions without new NightShift features.
|
||||
|
||||
Recommendation: viable, but defer until generated tests are stable.
|
||||
|
||||
Possible pipeline:
|
||||
|
||||
```text
|
||||
write_tests -> validate_tests -> analyze_tests -> implement
|
||||
```
|
||||
|
||||
Analyzer output should be concrete:
|
||||
|
||||
```text
|
||||
Implementation requirements:
|
||||
- create_app(database_path) must return a Flask app.
|
||||
- POST /snippets must return 201 and JSON id.
|
||||
- GET /snippets/<id> must return persisted fields.
|
||||
|
||||
Do not modify:
|
||||
- tests/test_task001.py
|
||||
```
|
||||
|
||||
This may help smaller models, but it is another model output that can be wrong. Add it only after the fixed-test pipeline works through all pastebin tasks.
|
||||
|
||||
## P2/P3: Add A Test Planner
|
||||
|
||||
Original idea:
|
||||
|
||||
- A test planner understands acceptance criteria and code.
|
||||
- Provides input to the next stage about constraints and code, especially for non-TDD.
|
||||
|
||||
Recommendation: maybe, but defer.
|
||||
|
||||
This overlaps with:
|
||||
|
||||
- planner
|
||||
- test analyzer
|
||||
- test governance
|
||||
|
||||
Too many planning-ish stages can make the pipeline bloated and contradictory.
|
||||
|
||||
If implemented later, keep it focused:
|
||||
|
||||
```text
|
||||
test_planner -> write_tests -> test_governance -> implement
|
||||
```
|
||||
|
||||
For now, fold this idea into the future test governance/analyzer work.
|
||||
|
||||
## P1: Add Fixed Tests For All Pastebin Tasks
|
||||
|
||||
Status: mostly implemented in the template.
|
||||
|
||||
Current fixed tests:
|
||||
|
||||
```text
|
||||
tests/test_task001.py
|
||||
tests/test_task002.py
|
||||
tests/test_task003.py
|
||||
tests/test_task004.py
|
||||
tests/test_task005.py
|
||||
```
|
||||
|
||||
Important design:
|
||||
|
||||
```yaml
|
||||
python -m pytest -q tests/test_{task_id_compact}.py
|
||||
```
|
||||
|
||||
This lets all future task tests exist without breaking earlier tasks.
|
||||
|
||||
Next step: validate these through integration runs, one task at a time.
|
||||
|
||||
## P1: Add `nightshift integ-report`
|
||||
|
||||
Status: implemented as a first-pass artifact summarizer.
|
||||
|
||||
New idea.
|
||||
|
||||
Summarize latest integration run across tasks:
|
||||
|
||||
```text
|
||||
TASK-001 complete in 1 retry
|
||||
TASK-002 failed at validate_patch
|
||||
Root cause: protected tests modified
|
||||
Artifacts: ...
|
||||
```
|
||||
|
||||
Right now we inspect artifacts manually. NightShift should do more of that.
|
||||
|
||||
Possible command:
|
||||
|
||||
```powershell
|
||||
python -m nightshift.cli integ-report --latest
|
||||
```
|
||||
|
||||
## P1: Add Task-Test Preflight To `validate`
|
||||
|
||||
Status: implemented.
|
||||
|
||||
`nightshift validate` now renders task command placeholders for every task and fails early if a configured `tests/test_*.py` path is missing.
|
||||
|
||||
Partially implemented at run time.
|
||||
|
||||
Current behavior:
|
||||
|
||||
- task command placeholders can render paths like `tests/test_task002.py`
|
||||
- `run_task` preflight fails before invoking agents if the task-specific test file is missing
|
||||
|
||||
Better behavior:
|
||||
|
||||
```powershell
|
||||
nightshift validate
|
||||
```
|
||||
|
||||
should warn or fail:
|
||||
|
||||
```text
|
||||
TASK-003 expects tests/test_task003.py and it exists.
|
||||
TASK-004 expects tests/test_task004.py and it exists.
|
||||
```
|
||||
|
||||
This catches missing fixed tests earlier.
|
||||
|
||||
## P2: Add Run Comparison
|
||||
|
||||
New idea.
|
||||
|
||||
Useful once comparing 14B vs 30B:
|
||||
|
||||
```powershell
|
||||
nightshift compare-runs --latest 5
|
||||
```
|
||||
|
||||
Show:
|
||||
|
||||
- model
|
||||
- task
|
||||
- retries
|
||||
- failure stage
|
||||
- final reason
|
||||
- runtime
|
||||
- token estimate
|
||||
|
||||
This should come after `integ-test` and `integ-report`.
|
||||
|
||||
|
|
@ -1,4 +1,4 @@
|
|||
# Tutorial 03: Pastebin With Model Fallback And Telemetry
|
||||
# Tutorial 03: Pastebin With Fixed Tests And Telemetry
|
||||
|
||||
This tutorial uses the `tutorial-pastebin` template: a small Flask snippet-hosting service designed for deterministic NightShift orchestration tests.
|
||||
|
||||
|
|
@ -19,6 +19,12 @@ For an isolated local integration run, use the integration sandbox command from
|
|||
python -m nightshift.cli integ-run --template tutorial-pastebin
|
||||
```
|
||||
|
||||
To create, set up, validate, and run one task in a single command:
|
||||
|
||||
```bash
|
||||
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
|
||||
```
|
||||
|
||||
To create the sandbox and set up the Python project immediately:
|
||||
|
||||
```bash
|
||||
|
|
@ -57,7 +63,7 @@ pyproject.toml
|
|||
README.md
|
||||
```
|
||||
|
||||
The template includes a tiny Flask `create_app(database_path=None)` scaffold and fixed `TASK-001` tests. The default tutorial pipeline asks the implementation agent to make those deterministic tests pass before review.
|
||||
The template includes a tiny Flask `create_app(database_path=None)` scaffold and fixed tests for each tutorial task. The default tutorial pipeline asks the implementation agent to make only the current task's deterministic tests pass before review.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
|
|
@ -73,26 +79,22 @@ Install target dependencies:
|
|||
python -m pip install -e . pytest flask
|
||||
```
|
||||
|
||||
Install and start Ollama, then pull the fallback models you want available:
|
||||
Install and start Ollama, then pull the default pastebin model:
|
||||
|
||||
```bash
|
||||
ollama pull qwen2.5-coder:14b
|
||||
ollama pull carstenuhlig/omnicoder-9b
|
||||
ollama pull deepseek-coder-v2:16b
|
||||
ollama pull qwen3-coder:30b
|
||||
ollama list
|
||||
```
|
||||
|
||||
NightShift uses Ollama's local HTTP API, normally at `http://localhost:11434`.
|
||||
|
||||
## Model Fallback
|
||||
## Model
|
||||
|
||||
The implementation stage uses this fallback order:
|
||||
The default pastebin pipeline uses one strong local coder model:
|
||||
|
||||
1. `qwen2.5-coder:14b`
|
||||
2. `carstenuhlig/omnicoder-9b`
|
||||
3. `deepseek-coder-v2:16b`
|
||||
- `qwen3-coder:30b`
|
||||
|
||||
NightShift records which agent/model handled each stage in `telemetry-summary.md`.
|
||||
NightShift records which agent/model handled each stage in `telemetry-summary.md`. Multi-candidate fallback belongs in a separate experiment template, not the default pastebin reliability harness.
|
||||
|
||||
## TDD Pipeline
|
||||
|
||||
|
|
|
|||
|
|
@ -20,51 +20,49 @@ safety:
|
|||
- curl | bash
|
||||
|
||||
experiment:
|
||||
label: pastebin-model-fallback
|
||||
prompt_variant: tdd-qwen-omnicoder-deepseek-v2
|
||||
label: pastebin-qwen3-coder
|
||||
prompt_variant: fixed-tests-qwen3-coder-30b-v1
|
||||
|
||||
agents:
|
||||
planner:
|
||||
backend: ollama
|
||||
model: qwen2.5-coder:14b
|
||||
model: qwen3-coder:30b
|
||||
temperature: 0.2
|
||||
num_ctx: 8192
|
||||
num_predict: 4096
|
||||
system_prompt: .nightshift/agents/planner.md
|
||||
|
||||
implementer_qwen:
|
||||
implementer:
|
||||
backend: ollama
|
||||
model: qwen2.5-coder:14b
|
||||
model: qwen3-coder:30b
|
||||
temperature: 0.1
|
||||
num_ctx: 8192
|
||||
num_predict: 4096
|
||||
system_prompt: .nightshift/agents/implementer.md
|
||||
|
||||
test_writer:
|
||||
backend: ollama
|
||||
model: qwen2.5-coder:14b
|
||||
model: qwen3-coder:30b
|
||||
temperature: 0.1
|
||||
num_ctx: 8192
|
||||
num_predict: 4096
|
||||
system_prompt: .nightshift/agents/test-writer.md
|
||||
|
||||
implementer_omnicoder:
|
||||
backend: ollama
|
||||
model: carstenuhlig/omnicoder-9b
|
||||
temperature: 0.1
|
||||
system_prompt: .nightshift/agents/implementer.md
|
||||
|
||||
implementer_deepseek:
|
||||
backend: ollama
|
||||
model: deepseek-coder-v2:16b
|
||||
temperature: 0.1
|
||||
system_prompt: .nightshift/agents/implementer.md
|
||||
|
||||
debugger:
|
||||
backend: ollama
|
||||
model: qwen2.5-coder:14b
|
||||
model: qwen3-coder:30b
|
||||
role: debugger
|
||||
temperature: 0.1
|
||||
num_ctx: 8192
|
||||
num_predict: 4096
|
||||
system_prompt: .nightshift/agents/debugger.md
|
||||
|
||||
reviewer:
|
||||
backend: ollama
|
||||
model: qwen2.5-coder:14b
|
||||
model: qwen3-coder:30b
|
||||
temperature: 0.1
|
||||
num_ctx: 8192
|
||||
num_predict: 4096
|
||||
system_prompt: .nightshift/agents/reviewer.md
|
||||
|
||||
pipeline:
|
||||
|
|
@ -87,10 +85,7 @@ pipeline:
|
|||
|
||||
- id: implement
|
||||
type: file_writer
|
||||
agent_pool:
|
||||
- implementer_qwen
|
||||
- implementer_omnicoder
|
||||
- implementer_deepseek
|
||||
agent: implementer
|
||||
output: proposed.patch
|
||||
|
||||
- id: normalize
|
||||
|
|
|
|||
|
|
@ -228,8 +228,9 @@ class AgentExecutor:
|
|||
"prompt": prompt,
|
||||
"stream": False,
|
||||
}
|
||||
if agent.temperature is not None:
|
||||
body["options"] = {"temperature": agent.temperature}
|
||||
options = _ollama_options(agent)
|
||||
if options:
|
||||
body["options"] = options
|
||||
headers = {"Content-Type": "application/json"}
|
||||
started = time.monotonic()
|
||||
self.logger.event(
|
||||
|
|
@ -395,6 +396,21 @@ def build_prompt_bundle(
|
|||
)
|
||||
|
||||
|
||||
def _ollama_options(agent: AgentConfig) -> dict[str, object]:
|
||||
options: dict[str, object] = {}
|
||||
if agent.temperature is not None:
|
||||
options["temperature"] = agent.temperature
|
||||
if agent.num_ctx is not None:
|
||||
options["num_ctx"] = agent.num_ctx
|
||||
if agent.num_predict is not None:
|
||||
options["num_predict"] = agent.num_predict
|
||||
if agent.seed is not None:
|
||||
options["seed"] = agent.seed
|
||||
if agent.stop:
|
||||
options["stop"] = list(agent.stop)
|
||||
return options
|
||||
|
||||
|
||||
def _coerce_output(value: str | bytes | None) -> str:
|
||||
if value is None:
|
||||
return ""
|
||||
|
|
|
|||
|
|
@ -7,13 +7,16 @@ from pathlib import Path
|
|||
import sys
|
||||
|
||||
from .config import validate_config
|
||||
from .errors import NightShiftError
|
||||
from .errors import ConfigError, NightShiftError
|
||||
from .init import available_templates, init_project
|
||||
from .integ import create_integration_run
|
||||
from .integ_report import build_integration_report, format_integration_report
|
||||
from .integ_setup import format_setup_result, setup_python_project
|
||||
from .integ_test import format_integration_test_result, run_integration_test
|
||||
from .pipeline import PipelineRunner
|
||||
from .runlog import RunLogger
|
||||
from .status import build_status, format_status
|
||||
from .task_tests import check_task_test_files, format_task_test_checks, missing_task_test_paths
|
||||
from .terminal import HOTDOG_ANIMATIONS, TerminalAnimation, format_banner, style_text
|
||||
from .tasks import (
|
||||
ensure_dependencies_satisfied,
|
||||
|
|
@ -105,6 +108,33 @@ def build_parser() -> argparse.ArgumentParser:
|
|||
help="Print --setup commands without running them.",
|
||||
)
|
||||
|
||||
integ_test_parser = subparsers.add_parser(
|
||||
"integ-test",
|
||||
help="Create, set up, validate, and run an integration template task.",
|
||||
)
|
||||
integ_test_parser.add_argument("--root", default=".", help="Repository root where integ_runs/ is created.")
|
||||
integ_test_parser.add_argument(
|
||||
"--template",
|
||||
default="tutorial-pastebin",
|
||||
choices=available_templates(),
|
||||
help="Template to initialize inside the sandbox.",
|
||||
)
|
||||
integ_test_parser.add_argument("--task", help="Specific task id to run.")
|
||||
integ_test_parser.add_argument("--all", action="store_true", help="Run all runnable incomplete tasks.")
|
||||
integ_test_parser.add_argument("--keep", type=int, help="Keep only the newest N old integration runs before creating a new one.")
|
||||
integ_test_parser.add_argument(
|
||||
"--setup-extra",
|
||||
action="append",
|
||||
default=["pytest"],
|
||||
help="Extra package to install during setup. May be repeated. Defaults to pytest.",
|
||||
)
|
||||
integ_test_parser.add_argument("--setup-skip-validate", action="store_true", help="Skip validation during setup.")
|
||||
integ_test_parser.add_argument("--dry-run", action="store_true", help="Print commands without running setup or tasks.")
|
||||
|
||||
integ_report_parser = subparsers.add_parser("integ-report", help="Summarize the latest integration run.")
|
||||
integ_report_parser.add_argument("--root", default=".", help="Repository root where integ_runs/ is located.")
|
||||
integ_report_parser.add_argument("--latest", action="store_true", help="Report the latest integration run.")
|
||||
|
||||
setup_parser = subparsers.add_parser(
|
||||
"integ-setup",
|
||||
help="Set up a Python integration project venv and dependencies.",
|
||||
|
|
@ -160,12 +190,18 @@ def main(argv: list[str] | None = None) -> int:
|
|||
config = validate_config(args.config)
|
||||
tasks = parse_task_file(config.project.root, config.project.task_file)
|
||||
validate_task_dependencies(tasks)
|
||||
task_test_checks = check_task_test_files(config, tasks)
|
||||
missing_task_tests = missing_task_test_paths(task_test_checks)
|
||||
if missing_task_tests:
|
||||
details = format_task_test_checks(task_test_checks)
|
||||
raise ConfigError(f"Config error: missing configured task test files.\n{details}")
|
||||
incomplete = sum(1 for task in tasks if not task.completed)
|
||||
print(f"Config valid: {config.path}")
|
||||
print(f"Project: {config.project.name}")
|
||||
print(f"Stages: {len(config.pipeline.stages)}")
|
||||
print(f"Tasks: {len(tasks)}")
|
||||
print(f"Incomplete tasks: {incomplete}")
|
||||
print(format_task_test_checks(task_test_checks))
|
||||
return 0
|
||||
|
||||
if args.command == "run":
|
||||
|
|
@ -256,6 +292,25 @@ def main(argv: list[str] | None = None) -> int:
|
|||
print(format_setup_result(result))
|
||||
return 0
|
||||
|
||||
if args.command == "integ-test":
|
||||
result = run_integration_test(
|
||||
args.root,
|
||||
template=args.template,
|
||||
task=args.task,
|
||||
all_tasks=args.all,
|
||||
keep=args.keep,
|
||||
setup_extras=tuple(args.setup_extra or ()),
|
||||
skip_setup_validate=args.setup_skip_validate,
|
||||
dry_run=args.dry_run,
|
||||
)
|
||||
print(format_integration_test_result(result))
|
||||
return result.exit_code
|
||||
|
||||
if args.command == "integ-report":
|
||||
report = build_integration_report(args.root, latest=True)
|
||||
print(format_integration_report(report))
|
||||
return 0
|
||||
|
||||
except NightShiftError as exc:
|
||||
print(str(exc), file=sys.stderr)
|
||||
return 1
|
||||
|
|
|
|||
|
|
@ -5,6 +5,7 @@ from __future__ import annotations
|
|||
from dataclasses import dataclass
|
||||
import os
|
||||
from pathlib import Path
|
||||
import re
|
||||
import shlex
|
||||
import subprocess
|
||||
import sys
|
||||
|
|
@ -68,11 +69,16 @@ class CommandExecutor:
|
|||
command_index=index,
|
||||
command=command,
|
||||
)
|
||||
rendered_command = render_command_template(command, task_id)
|
||||
rendered_allowed_commands = tuple(
|
||||
render_command_template(allowed, task_id) for allowed in self.safety.allowed_commands
|
||||
)
|
||||
run = self.run_command(
|
||||
command,
|
||||
rendered_command,
|
||||
shell=stage.shell,
|
||||
timeout_seconds=stage.timeout_seconds,
|
||||
working_dir=stage.working_dir,
|
||||
allowed_commands=rendered_allowed_commands,
|
||||
)
|
||||
runs.append(run)
|
||||
self.logger.event(
|
||||
|
|
@ -120,11 +126,12 @@ class CommandExecutor:
|
|||
shell: bool = True,
|
||||
timeout_seconds: int | None = None,
|
||||
working_dir: Path | None = None,
|
||||
allowed_commands: tuple[str, ...] | None = None,
|
||||
) -> CommandRun:
|
||||
try:
|
||||
normalized = ensure_command_allowed(
|
||||
command,
|
||||
self.safety.allowed_commands,
|
||||
allowed_commands if allowed_commands is not None else self.safety.allowed_commands,
|
||||
self.safety.forbidden_commands,
|
||||
)
|
||||
except SafetyError as exc:
|
||||
|
|
@ -210,6 +217,27 @@ def format_command_runs(stage_id: str, runs: list[CommandRun]) -> str:
|
|||
return "\n".join(lines)
|
||||
|
||||
|
||||
def render_command_template(command: str, task_id: str) -> str:
|
||||
task_id_lower = task_id.lower()
|
||||
task_id_slug = task_id_lower.replace("-", "_")
|
||||
task_id_compact = task_id_lower.replace("-", "")
|
||||
return command.format(
|
||||
task_id=task_id,
|
||||
task_id_lower=task_id_lower,
|
||||
task_id_slug=task_id_slug,
|
||||
task_id_compact=task_id_compact,
|
||||
)
|
||||
|
||||
|
||||
def extract_test_file_paths(command: str) -> tuple[str, ...]:
|
||||
paths: list[str] = []
|
||||
for match in re.finditer(r"(?<![\w./\\-])(tests[\\/][^\s`'\"<>|&;]+\.py)", command):
|
||||
path = match.group(1).replace("\\", "/")
|
||||
if path not in paths:
|
||||
paths.append(path)
|
||||
return tuple(paths)
|
||||
|
||||
|
||||
def _coerce_output(value: str | bytes | None) -> str:
|
||||
if value is None:
|
||||
return ""
|
||||
|
|
|
|||
|
|
@ -46,6 +46,10 @@ class AgentConfig:
|
|||
temperature: float | None = None
|
||||
base_url: str | None = None
|
||||
api_key_env: str | None = None
|
||||
num_ctx: int | None = None
|
||||
num_predict: int | None = None
|
||||
seed: int | None = None
|
||||
stop: tuple[str, ...] = ()
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
|
|
@ -207,10 +211,18 @@ def parse_config(raw: dict[str, Any], config_path: Path) -> NightShiftConfig:
|
|||
agent_raw.get("temperature"),
|
||||
f"agents.{agent_id}.temperature",
|
||||
)
|
||||
num_ctx = _optional_int_or_none(agent_raw.get("num_ctx"), f"agents.{agent_id}.num_ctx")
|
||||
num_predict = _optional_int_or_none(agent_raw.get("num_predict"), f"agents.{agent_id}.num_predict")
|
||||
seed = _optional_int_or_none(agent_raw.get("seed"), f"agents.{agent_id}.seed")
|
||||
stop = _string_tuple(agent_raw.get("stop", []), f"agents.{agent_id}.stop")
|
||||
if temperature is not None and temperature < 0:
|
||||
raise ConfigError(
|
||||
f"Config error: agents.{agent_id}.temperature must be zero or greater."
|
||||
)
|
||||
if num_ctx is not None and num_ctx <= 0:
|
||||
raise ConfigError(f"Config error: agents.{agent_id}.num_ctx must be greater than zero.")
|
||||
if num_predict is not None and num_predict <= 0:
|
||||
raise ConfigError(f"Config error: agents.{agent_id}.num_predict must be greater than zero.")
|
||||
if backend not in {"command", "ollama", "openai_compatible"}:
|
||||
raise ConfigError(
|
||||
f"Config error: agent '{agent_id}' uses unsupported backend '{backend}'. "
|
||||
|
|
@ -243,6 +255,10 @@ def parse_config(raw: dict[str, Any], config_path: Path) -> NightShiftConfig:
|
|||
temperature=temperature,
|
||||
base_url=base_url,
|
||||
api_key_env=api_key_env,
|
||||
num_ctx=num_ctx,
|
||||
num_predict=num_predict,
|
||||
seed=seed,
|
||||
stop=stop,
|
||||
)
|
||||
|
||||
experiment_raw = raw.get("experiment", {})
|
||||
|
|
|
|||
71
nightshift/integ_report.py
Normal file
71
nightshift/integ_report.py
Normal file
|
|
@ -0,0 +1,71 @@
|
|||
"""Summarize integration run artifacts."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
import re
|
||||
|
||||
from .errors import NightShiftError
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class IntegrationReport:
|
||||
integration_run: Path
|
||||
nightshift_run: Path | None
|
||||
lines: tuple[str, ...]
|
||||
|
||||
|
||||
def build_integration_report(root: str | Path = ".", *, latest: bool = True) -> IntegrationReport:
|
||||
base = Path(root).resolve() / "integ_runs"
|
||||
if not base.exists():
|
||||
raise NightShiftError(f"Integration report error: no integ_runs directory found: {base}")
|
||||
runs = sorted((path for path in base.iterdir() if path.is_dir()), key=lambda path: path.name, reverse=True)
|
||||
if not runs:
|
||||
raise NightShiftError(f"Integration report error: no integration runs found under: {base}")
|
||||
integration_run = runs[0] if latest else runs[0]
|
||||
artifacts_root = integration_run / "project" / ".nightshift" / "runs"
|
||||
if not artifacts_root.exists():
|
||||
return IntegrationReport(
|
||||
integration_run,
|
||||
None,
|
||||
("No NightShift run artifacts found. Setup may have failed before task execution.",),
|
||||
)
|
||||
nightshift_runs = sorted((path for path in artifacts_root.iterdir() if path.is_dir()), key=lambda path: path.name, reverse=True)
|
||||
if not nightshift_runs:
|
||||
return IntegrationReport(integration_run, None, ("No NightShift run directories found.",))
|
||||
nightshift_run = nightshift_runs[0]
|
||||
summaries = sorted(nightshift_run.glob("tasks/*/run-summary.md"))
|
||||
if not summaries and (nightshift_run / "run-summary.md").exists():
|
||||
summaries = [nightshift_run / "run-summary.md"]
|
||||
lines = [_summarize_run_summary(path, integration_run) for path in summaries]
|
||||
return IntegrationReport(integration_run, nightshift_run, tuple(lines or ("No task summaries found.",)))
|
||||
|
||||
|
||||
def format_integration_report(report: IntegrationReport) -> str:
|
||||
lines = [f"Integration run: {report.integration_run}"]
|
||||
if report.nightshift_run is not None:
|
||||
lines.append(f"NightShift run: {report.nightshift_run}")
|
||||
lines.append("")
|
||||
lines.extend(f"- {line}" for line in report.lines)
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def _summarize_run_summary(path: Path, integration_run: Path) -> str:
|
||||
text = path.read_text(encoding="utf-8", errors="replace")
|
||||
task = _field(text, "Task") or path.parent.name
|
||||
status = _field(text, "Status") or "unknown"
|
||||
retries = _field(text, "Retry count") or "unknown"
|
||||
reason = _field(text, "Reason") or "no reason recorded"
|
||||
try:
|
||||
relative = path.relative_to(integration_run)
|
||||
except ValueError:
|
||||
relative = path
|
||||
return f"{task} {status} after {retries} retries. Reason: {reason}. Artifacts: {relative.parent}"
|
||||
|
||||
|
||||
def _field(text: str, name: str) -> str | None:
|
||||
match = re.search(rf"^- {re.escape(name)}:\s*(.+)$", text, flags=re.MULTILINE)
|
||||
if not match:
|
||||
return None
|
||||
return match.group(1).strip()
|
||||
71
nightshift/integ_test.py
Normal file
71
nightshift/integ_test.py
Normal file
|
|
@ -0,0 +1,71 @@
|
|||
"""End-to-end integration test wrapper."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
import subprocess
|
||||
|
||||
from .errors import NightShiftError
|
||||
from .integ import IntegrationRun, create_integration_run
|
||||
from .integ_setup import IntegrationSetupResult, setup_python_project
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class IntegrationTestResult:
|
||||
run: IntegrationRun
|
||||
setup: IntegrationSetupResult
|
||||
command: tuple[str, ...]
|
||||
exit_code: int
|
||||
dry_run: bool
|
||||
|
||||
|
||||
def run_integration_test(
|
||||
root: str | Path = ".",
|
||||
*,
|
||||
template: str = "tutorial-pastebin",
|
||||
task: str | None = None,
|
||||
all_tasks: bool = False,
|
||||
keep: int | None = None,
|
||||
setup_extras: tuple[str, ...] = ("pytest",),
|
||||
skip_setup_validate: bool = False,
|
||||
dry_run: bool = False,
|
||||
) -> IntegrationTestResult:
|
||||
if task and all_tasks:
|
||||
raise NightShiftError("Integration test error: use either --task or --all, not both.")
|
||||
if not task and not all_tasks:
|
||||
raise NightShiftError("Integration test error: provide --task or --all.")
|
||||
|
||||
run = create_integration_run(Path(root), template=template, keep=keep)
|
||||
project = run.directory / "project"
|
||||
setup = setup_python_project(
|
||||
project,
|
||||
extras=setup_extras,
|
||||
validate=not skip_setup_validate,
|
||||
dry_run=dry_run,
|
||||
)
|
||||
command = [str(setup.python), "-m", "nightshift.cli", "run", "--no-animation"]
|
||||
if all_tasks:
|
||||
command.append("--all")
|
||||
else:
|
||||
command.extend(["--task", task or ""])
|
||||
|
||||
exit_code = 0
|
||||
if not dry_run:
|
||||
completed = subprocess.run(command, cwd=project, text=True, encoding="utf-8", errors="replace")
|
||||
exit_code = completed.returncode
|
||||
return IntegrationTestResult(run, setup, tuple(command), exit_code, dry_run)
|
||||
|
||||
|
||||
def format_integration_test_result(result: IntegrationTestResult) -> str:
|
||||
lines = [
|
||||
f"Integration run: {result.run.directory}",
|
||||
f"Project: {result.run.directory / 'project'}",
|
||||
f"Venv: {result.run.venv_dir}",
|
||||
f"Run command: {' '.join(result.command)}",
|
||||
f"Exit code: {result.exit_code}",
|
||||
f"Artifacts: {result.run.directory / 'project' / '.nightshift'}",
|
||||
]
|
||||
if result.dry_run:
|
||||
lines.insert(3, "Dry run: true")
|
||||
return "\n".join(lines)
|
||||
|
|
@ -9,7 +9,7 @@ import subprocess
|
|||
|
||||
from .agents import AgentExecutor
|
||||
from .artifacts import ArtifactStore
|
||||
from .commands import CommandExecutor
|
||||
from .commands import CommandExecutor, extract_test_file_paths, render_command_template
|
||||
from .config import COMMAND_STAGE_TYPES, NightShiftConfig, StageConfig
|
||||
from .context import ContextManager
|
||||
from .dependencies import diagnose_python_dependencies, format_dependency_diagnostic
|
||||
|
|
@ -145,6 +145,12 @@ class PipelineRunner:
|
|||
index = 0
|
||||
final_status = "complete"
|
||||
final_reason = "Pipeline completed."
|
||||
preflight_result = self._preflight_task(task, stages)
|
||||
if preflight_result:
|
||||
stage_results.append(preflight_result)
|
||||
final_status = "failed"
|
||||
final_reason = preflight_result.reason
|
||||
index = len(stages)
|
||||
|
||||
while index < len(stages):
|
||||
stage = stages[index]
|
||||
|
|
@ -248,6 +254,13 @@ class PipelineRunner:
|
|||
"retry-memory.md",
|
||||
summarize_retry_memory(tuple(retry_memory)),
|
||||
)
|
||||
if _repeated_protected_path_violation(tuple(retry_memory)):
|
||||
final_status = "failed"
|
||||
final_reason = (
|
||||
"Escalation policy stopped retries: implementation repeatedly "
|
||||
"attempted to modify paths outside the stage allowlist."
|
||||
)
|
||||
break
|
||||
decision = evaluate_retry_churn(
|
||||
tuple(retry_memory),
|
||||
retry_budget=self.config.pipeline.max_task_retries + 1,
|
||||
|
|
@ -334,6 +347,45 @@ class PipelineRunner:
|
|||
reason=final_reason,
|
||||
)
|
||||
|
||||
def _preflight_task(self, task: Task, stages: list[StageConfig]) -> StageResult | None:
|
||||
missing_paths: list[str] = []
|
||||
for stage in stages:
|
||||
if stage.type not in COMMAND_STAGE_TYPES:
|
||||
continue
|
||||
for command in stage.commands:
|
||||
rendered = render_command_template(command, task.id)
|
||||
for path_text in extract_test_file_paths(rendered):
|
||||
if not (self.config.project.root / path_text).exists():
|
||||
missing_paths.append(path_text)
|
||||
if not missing_paths:
|
||||
return None
|
||||
unique_paths = tuple(dict.fromkeys(missing_paths))
|
||||
details = "\n".join(f"- `{path}`" for path in unique_paths)
|
||||
output_path = self.artifacts.write_stage_output(
|
||||
task.id,
|
||||
"preflight.md",
|
||||
"\n".join(
|
||||
[
|
||||
"# Task Preflight",
|
||||
"",
|
||||
"Status: fail",
|
||||
"Reason: configured task test file is missing.",
|
||||
"",
|
||||
"## Missing Files",
|
||||
"",
|
||||
details,
|
||||
"",
|
||||
]
|
||||
),
|
||||
)
|
||||
return StageResult(
|
||||
"preflight",
|
||||
"fail",
|
||||
"Task preflight failed: configured task test file is missing: "
|
||||
+ ", ".join(unique_paths),
|
||||
output_path=str(output_path.relative_to(self.config.project.root)),
|
||||
)
|
||||
|
||||
def run_tasks(self, tasks: list[Task] | tuple[Task, ...]) -> MultiTaskResult:
|
||||
self.artifacts.initialize_run()
|
||||
self.logger.bind(self.artifacts)
|
||||
|
|
@ -1428,6 +1480,18 @@ def _extract_exit_code(text: str) -> int | None:
|
|||
return None
|
||||
|
||||
|
||||
def _repeated_protected_path_violation(entries: tuple[RetryMemoryEntry, ...]) -> bool:
|
||||
recent = entries[-2:]
|
||||
if len(recent) < 2:
|
||||
return False
|
||||
return all(_is_protected_path_violation(entry.cause) for entry in recent)
|
||||
|
||||
|
||||
def _is_protected_path_violation(text: str) -> bool:
|
||||
lowered = text.lower()
|
||||
return "not allowed for this stage" in lowered and "tests/" in lowered.replace("\\", "/")
|
||||
|
||||
|
||||
def format_aggregate_run_summary(results: list[PipelineResult], status: str, reason: str) -> str:
|
||||
lines = [
|
||||
"# Run Summary",
|
||||
|
|
|
|||
|
|
@ -1,9 +1,11 @@
|
|||
You are the debugger agent for the NightShift pastebin tutorial.
|
||||
|
||||
Diagnose failed attempts without editing files.
|
||||
Distinguish inaccurate generated tests from implementation bugs.
|
||||
If tests are inaccurate for the current task, recommend retrying `write_tests`.
|
||||
Distinguish fixed-test/template problems from implementation bugs.
|
||||
This tutorial uses fixed task tests and task-specific pytest commands. Do not recommend `write_tests` unless the configured pipeline actually has a `write_tests` stage.
|
||||
If a current task appears to lack tests, report a template or test-selection problem.
|
||||
If implementation is wrong, recommend the smallest implementation repair and name files that should not be modified.
|
||||
Implementation agents must not edit files under `tests/`.
|
||||
Return:
|
||||
- concise diagnosis
|
||||
- recommended next action
|
||||
|
|
|
|||
|
|
@ -7,8 +7,10 @@ Do not add behavior for future tasks unless needed to satisfy the current tests.
|
|||
Use Flask and `sqlite3` from the Python standard library. Do not use SQLAlchemy, Flask-SQLAlchemy, or undeclared dependencies.
|
||||
Keep the public package name `pastebin_app`.
|
||||
Keep the public app entry point `create_app(database_path: str | None = None)`.
|
||||
Respect `database_path`; do not hard-code `snippets.db` when a database path is supplied.
|
||||
Tests should interact through HTTP routes and `create_app`, not through ORM/session globals.
|
||||
Do not use `app.before_first_request`; recent Flask versions removed it. Initialize required database tables inside `create_app` or inside the route helper before use.
|
||||
When adding columns to an existing sqlite table, handle existing databases idempotently with `ALTER TABLE` checks or a simple migration helper. `CREATE TABLE IF NOT EXISTS` does not add columns to an existing table.
|
||||
|
||||
Output only complete file content blocks.
|
||||
Use one fenced block per file:
|
||||
|
|
|
|||
|
|
@ -14,6 +14,12 @@ Or create an isolated integration sandbox from the NightShift repository root:
|
|||
python -m nightshift.cli integ-run --template tutorial-pastebin
|
||||
```
|
||||
|
||||
To create, set up, validate, and run one task in a single command:
|
||||
|
||||
```bash
|
||||
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
|
||||
```
|
||||
|
||||
To create the sandbox and set it up in one step:
|
||||
|
||||
```bash
|
||||
|
|
@ -48,12 +54,8 @@ nightshift what-happened
|
|||
|
||||
When running from an integration sandbox, the same commands are run inside `integ_runs/<timestamp>/project`.
|
||||
|
||||
The pipeline uses model fallback ordering for implementation attempts:
|
||||
|
||||
1. `qwen2.5-coder:14b`
|
||||
2. `carstenuhlig/omnicoder-9b`
|
||||
3. `deepseek-coder-v2:16b`
|
||||
The default pastebin pipeline uses `qwen3-coder:30b` for planning, implementation, debugging, test review, and final review. It intentionally does not use multi-candidate fallback; pastebin is the deterministic reliability harness.
|
||||
|
||||
Telemetry artifacts record which agent/model handled each stage and estimate token usage.
|
||||
|
||||
This template uses a TDD-oriented pipeline. It starts with a skeletal package, generates task-specific pytest tests from the current task acceptance criteria, reviews those tests for scope, and then implements only enough application code to pass them.
|
||||
This template uses fixed task-specific pytest files. The pipeline starts with a skeletal package, implements only the current task, runs `tests/test_{task_id_compact}.py`, and then reviews the result.
|
||||
|
|
|
|||
|
|
@ -20,51 +20,49 @@ safety:
|
|||
- curl | bash
|
||||
|
||||
experiment:
|
||||
label: pastebin-model-fallback
|
||||
prompt_variant: tdd-qwen-omnicoder-deepseek-v2
|
||||
label: pastebin-qwen3-coder
|
||||
prompt_variant: fixed-tests-qwen3-coder-30b-v1
|
||||
|
||||
agents:
|
||||
planner:
|
||||
backend: ollama
|
||||
model: qwen2.5-coder:14b
|
||||
model: qwen3-coder:30b
|
||||
temperature: 0.2
|
||||
num_ctx: 8192
|
||||
num_predict: 4096
|
||||
system_prompt: .nightshift/agents/planner.md
|
||||
|
||||
implementer_qwen:
|
||||
implementer:
|
||||
backend: ollama
|
||||
model: qwen2.5-coder:14b
|
||||
model: qwen3-coder:30b
|
||||
temperature: 0.1
|
||||
num_ctx: 8192
|
||||
num_predict: 4096
|
||||
system_prompt: .nightshift/agents/implementer.md
|
||||
|
||||
test_writer:
|
||||
backend: ollama
|
||||
model: qwen2.5-coder:14b
|
||||
model: qwen3-coder:30b
|
||||
temperature: 0.1
|
||||
num_ctx: 8192
|
||||
num_predict: 4096
|
||||
system_prompt: .nightshift/agents/test-writer.md
|
||||
|
||||
implementer_omnicoder:
|
||||
backend: ollama
|
||||
model: carstenuhlig/omnicoder-9b
|
||||
temperature: 0.1
|
||||
system_prompt: .nightshift/agents/implementer.md
|
||||
|
||||
implementer_deepseek:
|
||||
backend: ollama
|
||||
model: deepseek-coder-v2:16b
|
||||
temperature: 0.1
|
||||
system_prompt: .nightshift/agents/implementer.md
|
||||
|
||||
debugger:
|
||||
backend: ollama
|
||||
model: qwen2.5-coder:14b
|
||||
model: qwen3-coder:30b
|
||||
role: debugger
|
||||
temperature: 0.1
|
||||
num_ctx: 8192
|
||||
num_predict: 4096
|
||||
system_prompt: .nightshift/agents/debugger.md
|
||||
|
||||
reviewer:
|
||||
backend: ollama
|
||||
model: qwen2.5-coder:14b
|
||||
model: qwen3-coder:30b
|
||||
temperature: 0.1
|
||||
num_ctx: 8192
|
||||
num_predict: 4096
|
||||
system_prompt: .nightshift/agents/reviewer.md
|
||||
|
||||
pipeline:
|
||||
|
|
@ -87,10 +85,7 @@ pipeline:
|
|||
|
||||
- id: implement
|
||||
type: file_writer
|
||||
agent_pool:
|
||||
- implementer_qwen
|
||||
- implementer_omnicoder
|
||||
- implementer_deepseek
|
||||
agent: implementer
|
||||
output: proposed.patch
|
||||
|
||||
- id: normalize
|
||||
|
|
|
|||
|
|
@ -16,6 +16,7 @@ def test_create_snippet_returns_created_snippet_id(tmp_path):
|
|||
assert response.status_code == 201
|
||||
data = response.get_json()
|
||||
assert isinstance(data["id"], int)
|
||||
assert (tmp_path / "snippets.db").exists()
|
||||
|
||||
|
||||
def test_view_snippet_returns_persisted_fields(tmp_path):
|
||||
|
|
@ -38,6 +39,7 @@ def test_view_snippet_returns_persisted_fields(tmp_path):
|
|||
"title": "View me",
|
||||
"body": "stored body",
|
||||
}
|
||||
assert (tmp_path / "snippets.db").exists()
|
||||
|
||||
|
||||
def test_view_missing_snippet_returns_404(tmp_path):
|
||||
|
|
|
|||
|
|
@ -0,0 +1,50 @@
|
|||
from pastebin_app.app import create_app
|
||||
|
||||
|
||||
def test_create_snippet_accepts_optional_metadata(tmp_path):
|
||||
app = create_app(database_path=str(tmp_path / "snippets.db"))
|
||||
client = app.test_client()
|
||||
|
||||
response = client.post(
|
||||
"/snippets",
|
||||
json={
|
||||
"title": "Tagged",
|
||||
"body": "metadata body",
|
||||
"language": "python",
|
||||
"tags": ["alpha", "beta"],
|
||||
"expires_at": "2030-01-01T00:00:00",
|
||||
},
|
||||
)
|
||||
|
||||
assert response.status_code == 201
|
||||
assert isinstance(response.get_json()["id"], int)
|
||||
assert (tmp_path / "snippets.db").exists()
|
||||
|
||||
|
||||
def test_view_snippet_returns_optional_metadata(tmp_path):
|
||||
app = create_app(database_path=str(tmp_path / "snippets.db"))
|
||||
client = app.test_client()
|
||||
|
||||
created = client.post(
|
||||
"/snippets",
|
||||
json={
|
||||
"title": "Tagged",
|
||||
"body": "metadata body",
|
||||
"language": "python",
|
||||
"tags": ["alpha", "beta"],
|
||||
"expires_at": "2030-01-01T00:00:00",
|
||||
},
|
||||
).get_json()
|
||||
|
||||
response = client.get(f"/snippets/{created['id']}")
|
||||
|
||||
assert response.status_code == 200
|
||||
assert response.get_json() == {
|
||||
"id": created["id"],
|
||||
"title": "Tagged",
|
||||
"body": "metadata body",
|
||||
"language": "python",
|
||||
"tags": ["alpha", "beta"],
|
||||
"expires_at": "2030-01-01T00:00:00",
|
||||
}
|
||||
assert (tmp_path / "snippets.db").exists()
|
||||
|
|
@ -0,0 +1,47 @@
|
|||
from pastebin_app.app import create_app
|
||||
|
||||
|
||||
def _create(client, title, body, **metadata):
|
||||
response = client.post("/snippets", json={"title": title, "body": body, **metadata})
|
||||
assert response.status_code == 201
|
||||
return response.get_json()["id"]
|
||||
|
||||
|
||||
def test_list_snippets_newest_first(tmp_path):
|
||||
app = create_app(database_path=str(tmp_path / "snippets.db"))
|
||||
client = app.test_client()
|
||||
|
||||
first_id = _create(client, "First", "older")
|
||||
second_id = _create(client, "Second", "newer")
|
||||
|
||||
response = client.get("/snippets")
|
||||
|
||||
assert response.status_code == 200
|
||||
ids = [snippet["id"] for snippet in response.get_json()]
|
||||
assert ids[:2] == [second_id, first_id]
|
||||
|
||||
|
||||
def test_search_filters_by_title_or_body(tmp_path):
|
||||
app = create_app(database_path=str(tmp_path / "snippets.db"))
|
||||
client = app.test_client()
|
||||
_create(client, "Python note", "ordinary body")
|
||||
_create(client, "Other", "contains needle")
|
||||
|
||||
response = client.get("/snippets?q=python")
|
||||
assert [snippet["title"] for snippet in response.get_json()] == ["Python note"]
|
||||
|
||||
response = client.get("/snippets?q=needle")
|
||||
assert [snippet["title"] for snippet in response.get_json()] == ["Other"]
|
||||
|
||||
|
||||
def test_language_and_tag_filters(tmp_path):
|
||||
app = create_app(database_path=str(tmp_path / "snippets.db"))
|
||||
client = app.test_client()
|
||||
_create(client, "Python", "body", language="python", tags=["code", "demo"])
|
||||
_create(client, "Text", "body", language="text", tags=["notes"])
|
||||
|
||||
response = client.get("/snippets?language=python")
|
||||
assert [snippet["title"] for snippet in response.get_json()] == ["Python"]
|
||||
|
||||
response = client.get("/snippets?tag=notes")
|
||||
assert [snippet["title"] for snippet in response.get_json()] == ["Text"]
|
||||
|
|
@ -0,0 +1,43 @@
|
|||
from pastebin_app.app import create_app
|
||||
|
||||
|
||||
def test_expired_snippets_are_excluded_from_listing(tmp_path):
|
||||
app = create_app(database_path=str(tmp_path / "snippets.db"))
|
||||
client = app.test_client()
|
||||
client.post(
|
||||
"/snippets",
|
||||
json={"title": "Expired", "body": "old", "expires_at": "2000-01-01T00:00:00"},
|
||||
)
|
||||
active = client.post(
|
||||
"/snippets",
|
||||
json={"title": "Active", "body": "new", "expires_at": "2999-01-01T00:00:00"},
|
||||
).get_json()
|
||||
|
||||
response = client.get("/snippets")
|
||||
|
||||
assert response.status_code == 200
|
||||
assert [snippet["id"] for snippet in response.get_json()] == [active["id"]]
|
||||
|
||||
|
||||
def test_direct_lookup_of_expired_snippet_returns_410(tmp_path):
|
||||
app = create_app(database_path=str(tmp_path / "snippets.db"))
|
||||
client = app.test_client()
|
||||
expired = client.post(
|
||||
"/snippets",
|
||||
json={"title": "Expired", "body": "old", "expires_at": "2000-01-01T00:00:00"},
|
||||
).get_json()
|
||||
|
||||
response = client.get(f"/snippets/{expired['id']}")
|
||||
|
||||
assert response.status_code == 410
|
||||
|
||||
|
||||
def test_non_expiring_snippet_remains_visible(tmp_path):
|
||||
app = create_app(database_path=str(tmp_path / "snippets.db"))
|
||||
client = app.test_client()
|
||||
created = client.post("/snippets", json={"title": "Forever", "body": "body"}).get_json()
|
||||
|
||||
response = client.get(f"/snippets/{created['id']}")
|
||||
|
||||
assert response.status_code == 200
|
||||
assert response.get_json()["title"] == "Forever"
|
||||
|
|
@ -0,0 +1,46 @@
|
|||
from pastebin_app.app import create_app
|
||||
|
||||
|
||||
def test_root_shows_snippet_list_html(tmp_path):
|
||||
app = create_app(database_path=str(tmp_path / "snippets.db"))
|
||||
client = app.test_client()
|
||||
client.post("/snippets", json={"title": "Visible", "body": "body"})
|
||||
|
||||
response = client.get("/")
|
||||
|
||||
assert response.status_code == 200
|
||||
assert "Visible" in response.get_data(as_text=True)
|
||||
|
||||
|
||||
def test_new_snippet_form_loads(tmp_path):
|
||||
app = create_app(database_path=str(tmp_path / "snippets.db"))
|
||||
client = app.test_client()
|
||||
|
||||
response = client.get("/new")
|
||||
|
||||
assert response.status_code == 200
|
||||
html = response.get_data(as_text=True)
|
||||
assert 'name="title"' in html
|
||||
assert 'name="body"' in html
|
||||
assert 'name="language"' in html
|
||||
assert 'name="tags"' in html
|
||||
assert 'name="expires_at"' in html
|
||||
|
||||
|
||||
def test_form_post_redirects_to_snippet_view(tmp_path):
|
||||
app = create_app(database_path=str(tmp_path / "snippets.db"))
|
||||
client = app.test_client()
|
||||
|
||||
response = client.post(
|
||||
"/new",
|
||||
data={
|
||||
"title": "Form title",
|
||||
"body": "Form body",
|
||||
"language": "text",
|
||||
"tags": "forms,html",
|
||||
"expires_at": "",
|
||||
},
|
||||
)
|
||||
|
||||
assert response.status_code == 302
|
||||
assert response.headers["Location"].endswith("/snippets/1")
|
||||
48
nightshift/task_tests.py
Normal file
48
nightshift/task_tests.py
Normal file
|
|
@ -0,0 +1,48 @@
|
|||
"""Task-specific test file validation."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
from .commands import extract_test_file_paths, render_command_template
|
||||
from .config import COMMAND_STAGE_TYPES, NightShiftConfig
|
||||
from .tasks import Task
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class TaskTestCheck:
|
||||
task_id: str
|
||||
path: str
|
||||
exists: bool
|
||||
|
||||
|
||||
def check_task_test_files(config: NightShiftConfig, tasks: tuple[Task, ...] | list[Task]) -> tuple[TaskTestCheck, ...]:
|
||||
checks: list[TaskTestCheck] = []
|
||||
for task in tasks:
|
||||
seen: set[str] = set()
|
||||
for stage in config.pipeline.stages:
|
||||
if stage.type not in COMMAND_STAGE_TYPES:
|
||||
continue
|
||||
for command in stage.commands:
|
||||
rendered = render_command_template(command, task.id)
|
||||
for path_text in extract_test_file_paths(rendered):
|
||||
if path_text in seen:
|
||||
continue
|
||||
seen.add(path_text)
|
||||
checks.append(TaskTestCheck(task.id, path_text, (config.project.root / path_text).exists()))
|
||||
return tuple(checks)
|
||||
|
||||
|
||||
def format_task_test_checks(checks: tuple[TaskTestCheck, ...]) -> str:
|
||||
if not checks:
|
||||
return "Task test files: no task-specific test paths detected."
|
||||
lines = ["Task test files:"]
|
||||
for check in checks:
|
||||
status = "ok" if check.exists else "missing"
|
||||
lines.append(f"- {check.task_id}: {check.path} ({status})")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def missing_task_test_paths(checks: tuple[TaskTestCheck, ...]) -> tuple[Path, ...]:
|
||||
return tuple(Path(check.path) for check in checks if not check.exists)
|
||||
|
|
@ -6,6 +6,7 @@ from nightshift.artifacts import ArtifactStore
|
|||
from nightshift.commands import CommandExecutor
|
||||
from nightshift.commands import CommandRun, format_command_runs
|
||||
from nightshift.commands import _command_env
|
||||
from nightshift.commands import render_command_template
|
||||
from nightshift.config import SafetyConfig, StageConfig
|
||||
from nightshift.errors import CommandError
|
||||
import sys
|
||||
|
|
@ -16,6 +17,13 @@ FAILING_COMMAND = 'python -c "import sys; print(\'bad\'); sys.exit(7)"'
|
|||
|
||||
|
||||
class CommandExecutorTests(unittest.TestCase):
|
||||
def test_render_command_template_includes_task_id_variants(self) -> None:
|
||||
command = "python -m pytest -q tests/test_{task_id_compact}.py # {task_id_slug} {task_id}"
|
||||
|
||||
rendered = render_command_template(command, "TASK-001")
|
||||
|
||||
self.assertEqual(rendered, "python -m pytest -q tests/test_task001.py # task_001 TASK-001")
|
||||
|
||||
def test_passing_command_stage_returns_pass_and_writes_output(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as directory:
|
||||
root = Path(directory)
|
||||
|
|
@ -46,6 +54,33 @@ class CommandExecutorTests(unittest.TestCase):
|
|||
self.assertIn("Exit code: 0", output)
|
||||
self.assertIn("ok", output)
|
||||
|
||||
def test_command_stage_renders_task_id_before_allowlist_check(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as directory:
|
||||
root = Path(directory)
|
||||
artifacts = ArtifactStore(root, ".nightshift", run_id="test-run")
|
||||
executor = CommandExecutor(
|
||||
root,
|
||||
SafetyConfig(
|
||||
require_clean_worktree=False,
|
||||
scoped_paths=(".",),
|
||||
allowed_commands=('python -c "print(\'{task_id_compact}\')"',),
|
||||
forbidden_commands=("rm -rf",),
|
||||
),
|
||||
artifacts,
|
||||
)
|
||||
stage = StageConfig(
|
||||
id="test",
|
||||
type="command",
|
||||
commands=('python -c "print(\'{task_id_compact}\')"',),
|
||||
output="test-output.txt",
|
||||
)
|
||||
|
||||
result = executor.run_stage(stage, "TASK-002")
|
||||
|
||||
self.assertEqual(result.status, "pass")
|
||||
output = (root / result.output_path).read_text(encoding="utf-8")
|
||||
self.assertIn("task002", output)
|
||||
|
||||
def test_failing_command_stage_returns_fail_and_writes_output(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as directory:
|
||||
root = Path(directory)
|
||||
|
|
|
|||
|
|
@ -282,6 +282,27 @@ class ConfigTests(unittest.TestCase):
|
|||
|
||||
self.assertEqual(config.agents["planner"].temperature, 0.2)
|
||||
|
||||
def test_agent_ollama_options_load(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as directory:
|
||||
root = Path(directory)
|
||||
init_project(root)
|
||||
config_path = root / "nightshift.yaml"
|
||||
config_path.write_text(
|
||||
config_path.read_text(encoding="utf-8").replace(
|
||||
" system_prompt: agents/planner.md",
|
||||
" system_prompt: agents/planner.md\n num_ctx: 8192\n num_predict: 4096\n seed: 1\n stop:\n - STOP",
|
||||
1,
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
config = load_config(config_path)
|
||||
|
||||
self.assertEqual(config.agents["planner"].num_ctx, 8192)
|
||||
self.assertEqual(config.agents["planner"].num_predict, 4096)
|
||||
self.assertEqual(config.agents["planner"].seed, 1)
|
||||
self.assertEqual(config.agents["planner"].stop, ("STOP",))
|
||||
|
||||
def test_agent_temperature_must_be_number(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as directory:
|
||||
root = Path(directory)
|
||||
|
|
|
|||
|
|
@ -61,7 +61,7 @@ class InitProjectTests(unittest.TestCase):
|
|||
self.assertIn("tutorial-imageboard", available_templates())
|
||||
self.assertIn("tutorial-pastebin", available_templates())
|
||||
|
||||
def test_init_pastebin_template_creates_skeleton_and_model_fallback_config(self) -> None:
|
||||
def test_init_pastebin_template_creates_skeleton_and_qwen3_config(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as directory:
|
||||
root = Path(directory)
|
||||
|
||||
|
|
@ -78,11 +78,15 @@ class InitProjectTests(unittest.TestCase):
|
|||
self.assertIn("type: semantic_context", config)
|
||||
self.assertNotIn("id: write_tests", config)
|
||||
self.assertNotIn("id: review_tests", config)
|
||||
self.assertIn("python -m pytest -q tests", config)
|
||||
self.assertIn("python -m pytest -q tests/test_{task_id_compact}.py", config)
|
||||
self.assertIn("max_task_retries: 6", config)
|
||||
self.assertIn("implementer_qwen", config)
|
||||
self.assertIn("carstenuhlig/omnicoder-9b", config)
|
||||
self.assertIn("deepseek-coder-v2:16b", config)
|
||||
self.assertIn("implementer:", config)
|
||||
self.assertIn("qwen3-coder:30b", config)
|
||||
self.assertIn("num_ctx: 8192", config)
|
||||
self.assertIn("num_predict: 4096", config)
|
||||
self.assertNotIn("agent_pool:", config)
|
||||
self.assertNotIn("carstenuhlig/omnicoder-9b", config)
|
||||
self.assertNotIn("deepseek-coder-v2:16b", config)
|
||||
|
||||
def test_pastebin_example_tutorial_docs_exist(self) -> None:
|
||||
root = Path(__file__).resolve().parents[1]
|
||||
|
|
|
|||
51
tests/test_integ_test.py
Normal file
51
tests/test_integ_test.py
Normal file
|
|
@ -0,0 +1,51 @@
|
|||
from pathlib import Path
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
from nightshift.integ_report import build_integration_report, format_integration_report
|
||||
from nightshift.integ_test import format_integration_test_result, run_integration_test
|
||||
|
||||
|
||||
class IntegrationTestCommandTests(unittest.TestCase):
|
||||
def test_run_integration_test_dry_run_builds_task_command(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as directory:
|
||||
result = run_integration_test(
|
||||
directory,
|
||||
template="tutorial-pastebin",
|
||||
task="TASK-001",
|
||||
dry_run=True,
|
||||
)
|
||||
|
||||
rendered = format_integration_test_result(result)
|
||||
self.assertIn("Dry run: true", rendered)
|
||||
self.assertIn("TASK-001", " ".join(result.command))
|
||||
self.assertTrue((result.run.directory / "project" / "nightshift.yaml").exists())
|
||||
|
||||
def test_build_integration_report_summarizes_latest_task_summary(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as directory:
|
||||
root = Path(directory)
|
||||
summary = root / "integ_runs" / "20260521T000000.000000Z" / "project" / ".nightshift" / "runs" / "run1" / "tasks" / "TASK-001" / "run-summary.md"
|
||||
summary.parent.mkdir(parents=True)
|
||||
summary.write_text(
|
||||
"\n".join(
|
||||
[
|
||||
"# Run Summary",
|
||||
"",
|
||||
"- Task: TASK-001",
|
||||
"- Status: complete",
|
||||
"- Retry count: 1",
|
||||
"- Reason: Done.",
|
||||
]
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
report = build_integration_report(root)
|
||||
rendered = format_integration_report(report)
|
||||
|
||||
self.assertIn("TASK-001 complete after 1 retries", rendered)
|
||||
self.assertIn("Reason: Done.", rendered)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
|
|
@ -105,6 +105,29 @@ class PipelineRunnerTests(unittest.TestCase):
|
|||
)
|
||||
self.assertIn("Modified Files", (root / ".nightshift" / "runs" / "test-run" / "run-summary.md").read_text(encoding="utf-8"))
|
||||
|
||||
def test_task_preflight_fails_when_task_specific_test_file_is_missing(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as directory:
|
||||
root = Path(directory)
|
||||
_write_common_files(root)
|
||||
stages = (
|
||||
StageConfig(
|
||||
id="test",
|
||||
type="command",
|
||||
commands=("python -m pytest -q tests/test_{task_id_compact}.py",),
|
||||
output="test-output.txt",
|
||||
),
|
||||
)
|
||||
config = make_config(root, stages, max_retries=0)
|
||||
runner = PipelineRunner(config, ArtifactStore(root, ".nightshift", run_id="test-run"))
|
||||
task = parse_tasks(TASK_MD)[0]
|
||||
|
||||
result = runner.run_task(task)
|
||||
|
||||
self.assertEqual(result.status, "failed")
|
||||
self.assertIn("configured task test file is missing", result.reason)
|
||||
task_dir = root / ".nightshift" / "runs" / "test-run" / "tasks" / task.id
|
||||
self.assertIn("tests/test_task001.py", (task_dir / "preflight.md").read_text(encoding="utf-8"))
|
||||
|
||||
def test_review_can_retry_implementation_until_limit(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as directory:
|
||||
root = Path(directory)
|
||||
|
|
|
|||
77
tests/test_task_tests.py
Normal file
77
tests/test_task_tests.py
Normal file
|
|
@ -0,0 +1,77 @@
|
|||
from pathlib import Path
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
from nightshift.config import validate_config
|
||||
from nightshift.task_tests import check_task_test_files, missing_task_test_paths
|
||||
from nightshift.tasks import parse_task_file
|
||||
|
||||
|
||||
class TaskTestValidationTests(unittest.TestCase):
|
||||
def test_check_task_test_files_renders_task_placeholder(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as directory:
|
||||
root = Path(directory)
|
||||
(root / "agents").mkdir()
|
||||
(root / "agents" / "planner.md").write_text("Prompt", encoding="utf-8")
|
||||
(root / "tests").mkdir()
|
||||
(root / "tests" / "test_task001.py").write_text("def test_ok():\n assert True\n", encoding="utf-8")
|
||||
(root / "nightshift.yaml").write_text(
|
||||
"\n".join(
|
||||
[
|
||||
"project:",
|
||||
" name: task-test-validation",
|
||||
" root: .",
|
||||
" task_file: tasks.md",
|
||||
" artifact_dir: .nightshift",
|
||||
"",
|
||||
"safety:",
|
||||
" require_clean_worktree: false",
|
||||
" scoped_paths:",
|
||||
" - .",
|
||||
" allowed_commands:",
|
||||
" - python -m pytest -q tests/test_{task_id_compact}.py",
|
||||
" forbidden_commands:",
|
||||
" - rm -rf",
|
||||
"",
|
||||
"agents:",
|
||||
" planner:",
|
||||
" backend: command",
|
||||
" command: python -c \"print('ok')\"",
|
||||
" system_prompt: agents/planner.md",
|
||||
"",
|
||||
"pipeline:",
|
||||
" stages:",
|
||||
" - id: test",
|
||||
" type: command",
|
||||
" commands:",
|
||||
" - python -m pytest -q tests/test_{task_id_compact}.py",
|
||||
]
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
(root / "tasks.md").write_text(
|
||||
"""# Tasks
|
||||
|
||||
- [ ] TASK-001: One
|
||||
|
||||
Acceptance Criteria:
|
||||
- passes
|
||||
|
||||
- [ ] TASK-002: Two
|
||||
|
||||
Acceptance Criteria:
|
||||
- reports missing test
|
||||
""",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
config = validate_config(root / "nightshift.yaml")
|
||||
tasks = parse_task_file(config.project.root, config.project.task_file)
|
||||
checks = check_task_test_files(config, tasks)
|
||||
|
||||
self.assertEqual([check.path for check in checks], ["tests/test_task001.py", "tests/test_task002.py"])
|
||||
self.assertEqual(tuple(path.as_posix() for path in missing_task_test_paths(checks)), ("tests/test_task002.py",))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
Loading…
Reference in New Issue
Block a user