Add tutorial integration workflow helpers

- Add `integ-test` to create, set up, validate, and run integration template tasks
  - Add `integ-report` to summarize latest integration run artifacts
  - Switch default pastebin template from model fallback to single `qwen3-coder:30b`
  - Support optional Ollama fields: `num_ctx`, `num_predict`, `seed`, and `stop`
  - Add `nightshift validate` preflight for task-specific test files
  - Update pastebin docs, config reference, and ideas tracking
  - Add tests for integration helpers, task-test validation, config parsing, and template expectations
This commit is contained in:
K. Hodges 2026-05-21 03:46:27 -07:00
parent e3679296fd
commit f7fed4535b
29 changed files with 1251 additions and 280 deletions

View File

@ -1,195 +0,0 @@
# Bugfix TODO
## Some issues going with run --all
reason=Stage 'review' requested unknown next stage 'None'. Not every time. I think there's a pattern that is out of place here. Maybe it's related to the last task success? Or the last run?
## Going from individual tasks to --all fails
If you do nightshift run --task TASK-001 and then that completes and then you go to nightshift run --all it fails on blocked by missing dependencies: TASK-001 . I think this is because the tasks get reset at the top of the run, but there is something marking completion of TASK-001 requiring manual reset.
run --all should start at the first not done task (seems like it does)
## Some kind of tool install feature
Continually fails on flask_sqlalchemy until I install that.
## Tutorial need to include . directory for imageboard
## Git status artifacts are noisy for non-git repositories
Observed artifact:
```text
# Git Status before
Available: false
Exit code: 128
fatal: not a git repository (or any of the parent directories): .git
```
Current behavior:
- NightShift continues when `require_clean_worktree: false`.
- `git-status-before.txt`, `git-status-after.txt`, and `diff.patch` may contain git errors.
- This is technically safe, but confusing for users running quickstart/demo projects outside git.
Desired behavior:
- Detect non-git repositories explicitly.
- Write a clearer artifact message such as:
```text
Git repository: false
Clean-worktree enforcement: skipped because require_clean_worktree is false
Diff artifact: unavailable because project is not a git repository
```
- Avoid treating non-git as a scary-looking failure when clean worktree is not required.
Acceptance criteria:
- Non-git projects produce readable git artifacts without fatal-looking output.
- `require_clean_worktree: true` still fails safely in non-git projects.
- Reports mention that git metadata/diff is unavailable because the project is not a git repo.
## Git safe.directory / ownership conflicts on Windows
Observed context:
- Git can report dubious ownership or safe-directory errors when a repo was created or managed by a different Windows user identity.
- This may happen when using GitHub Desktop, WSL, admin shells, or multiple Windows accounts.
Current behavior:
- NightShift records the raw git error in artifacts.
- If `require_clean_worktree: true`, NightShift blocks execution.
- If `require_clean_worktree: false`, NightShift continues but git status/diff artifacts can look like hard failures.
Desired behavior:
- Detect common `dubious ownership` / `safe.directory` messages.
- Write a clearer explanation in artifacts and reports.
- Suggest the exact remediation outside NightShift, for example:
```powershell
git config --global --add safe.directory <project-root>
```
Acceptance criteria:
- Safe-directory failures are classified separately from ordinary git failures.
- Users get actionable guidance.
- NightShift does not attempt to change global git config automatically.
## Clarify docs around git requirements
Add to `QUICKSTART.md` and troubleshooting:
- Git is optional when `require_clean_worktree: false`.
- Git is required for clean-worktree enforcement and useful diffs.
- Non-git projects can still run pipelines.
- Git ownership/safe-directory errors affect git artifacts, not core task execution, unless clean-worktree enforcement is enabled.
## Console appears idle during long agent calls
Current behavior:
- Long Ollama calls can make `nightshift run` look frozen.
- Progress is only visible by inspecting `.nightshift/` artifacts or `ollama ps`.
Desired behavior:
- Print stage start/finish messages to the console.
- Include agent id, stage id, task id, and artifact path when available.
- Do not stream model output yet; just show lifecycle progress.
Acceptance criteria:
- User can tell which stage is running.
- Long-running model calls no longer look like a hung process.
## Ollama output can make review stages fail if not structured
Current behavior:
- Review stages require `status: pass | fail | retry | escalate`.
- General-purpose model output may include prose before/after the structured fields.
- If no valid status is found, the review stage fails.
Desired behavior:
- Keep strict structured review parsing, but improve prompt templates and error messages.
- Artifact should clearly say the review output was unparseable and show the expected contract.
Acceptance criteria:
- Failed review parsing is easy to diagnose from `review.md` and `stage-results.md`.
## `echo` fake agents do not behave consistently across shells
Current behavior:
- Starter templates use `command: echo`.
- Depending on shell/platform, `echo` may not preserve stdin or may only echo arguments.
- This can make fake agent artifacts less useful.
Desired behavior:
- Replace fake-agent defaults with small Python one-liners or documented fake-agent scripts.
- Keep examples cross-platform.
Acceptance criteria:
- Starter project produces predictable fake-agent output on Windows PowerShell/cmd and Unix shells.
## `unittest discover` behavior depends on test package layout
Current behavior:
- Python 3.14 returned `NO TESTS RAN` with exit code 5 for an example project until `tests/__init__.py` was added.
- Users may hit the same issue in fresh target repos.
Desired behavior:
- Document this in troubleshooting.
- Consider making quickstart templates include `tests/__init__.py`.
Acceptance criteria:
- Quickstart test command works in a fresh copied example.
- Troubleshooting mentions what to do if `NO TESTS RAN` appears.
## Task completion can mark tasks complete even if no source changed
Current behavior:
- A pipeline can pass with fake agents and passing tests, then mark the task complete.
- This is expected for fake/demo mode but surprising when users expect code edits.
Desired behavior:
- Add a warning when a task completes and git/diff detects no source changes, where git is available.
- Documentation should explain fake-agent mode vs editing-agent mode.
Acceptance criteria:
- Users are less likely to mistake artifact generation for code modification.
## Dashboard requires Flask but dependency is optional
Current behavior:
- `nightshift web` fails with a helpful message if Flask is missing.
- README mentions `pip install flask`, but install extras are not defined.
Desired behavior:
- Add an optional dependency group such as `nightshift[web]` later.
- Keep graceful error behavior.
Acceptance criteria:
- Users have one documented install command for dashboard support.

View File

@ -62,11 +62,19 @@ Ollama agent:
```yaml ```yaml
planner: planner:
backend: ollama backend: ollama
model: qwen2.5-coder:14b model: qwen3-coder:30b
base_url: http://localhost:11434 base_url: http://localhost:11434
system_prompt: agents/planner.md system_prompt: agents/planner.md
temperature: 0.2
num_ctx: 8192
num_predict: 4096
seed: 1
stop:
- STOP
``` ```
Optional Ollama generation options currently supported by NightShift are `temperature`, `num_ctx`, `num_predict`, `seed`, and `stop`.
## `pipeline` ## `pipeline`
- `max_task_retries`: task retry limit. - `max_task_retries`: task retry limit.
@ -76,6 +84,7 @@ planner:
Command stage options: Command stage options:
- `commands`: command strings. - `commands`: command strings.
- Command strings may use task placeholders: `{task_id}`, `{task_id_lower}`, `{task_id_slug}`, and `{task_id_compact}`.
- `shell`: defaults to true. Set false for argv-style execution. - `shell`: defaults to true. Set false for argv-style execution.
- `timeout_seconds`: per-stage timeout override. - `timeout_seconds`: per-stage timeout override.
- `working_dir`: command working directory inside project root. - `working_dir`: command working directory inside project root.
@ -141,6 +150,12 @@ Create a local integration sandbox from the NightShift repository root:
python -m nightshift.cli integ-run --template tutorial-pastebin python -m nightshift.cli integ-run --template tutorial-pastebin
``` ```
Create, set up, validate, and run one task from the generated project directory:
```bash
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
```
Set up the generated Python project: Set up the generated Python project:
```bash ```bash
@ -161,6 +176,12 @@ Preview commands without running them:
python -m nightshift.cli integ-setup --project integ_runs/<timestamp>/project --dry-run python -m nightshift.cli integ-setup --project integ_runs/<timestamp>/project --dry-run
``` ```
Summarize the latest integration artifact run:
```bash
python -m nightshift.cli integ-report --latest
```
To clean up old sandboxes before creating a new one, keep only the newest three existing runs: To clean up old sandboxes before creating a new one, keep only the newest three existing runs:
```bash ```bash
@ -169,8 +190,4 @@ python -m nightshift.cli integ-run --template tutorial-pastebin --keep 3
## Pastebin Tutorial ## Pastebin Tutorial
`nightshift init --template tutorial-pastebin` creates a small Flask snippet-hosting target with deterministic tests and incremental NightShift tasks. Its pipeline includes semantic context retrieval, telemetry, debugger support, and implementation fallback order: `nightshift init --template tutorial-pastebin` creates a small Flask snippet-hosting target with deterministic tests and incremental NightShift tasks. Its pipeline includes semantic context retrieval, telemetry, debugger support, fixed task-specific tests, and a single default `qwen3-coder:30b` model path.
- `qwen2.5-coder:14b`
- `carstenuhlig/omnicoder-9b`
- `deepseek-coder-v2:16b`

17
docs/future_ideas.md Normal file
View File

@ -0,0 +1,17 @@
### Future Ideas
Not to implement until we get successful long running runs.
## I am realizing "templates" are abstracted from the user
* I think templates will be a first class citizen, a package for deployments, and a harness for performance tests
* These should live external to nightshift/project_templates as users will likely create their own
* one solution would be to reference two directories when looking up templates, builtin ones will be in nightshift/project_templates or users can define a templates directory in their nightshift config
## nightshift config
* store user settings in ~/.nightshift/config.yaml
* things like templates folder (can also live here)
* maybe this is later
## A way to easily make A/B tests to benchmark models?
* Right now I can do this manually, for example I want to run the tutorial-pastebin with qwen3.6:27b as the planner and qwen2.5-coder:14b as the coder, and another with qwen3.6:27b as both, etc.
* Maybe there is a way to make it easier to do that, possibly by creating a template that can be controlled by a larger multi-run file?
* This is probably for way later.

366
docs/ideas.md Normal file
View File

@ -0,0 +1,366 @@
# Ideas TODO
This file is now prioritized inline. Priority scale:
- P0: do next; directly improves current feedback loop
- P1: important after the current loop is usable
- P2: useful, but only after basics are stable
- P3: defer or maybe reject
## P0: Make Integration Tests Easy To Run
Status: implemented.
Implemented command:
```powershell
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
```
It creates the integration sandbox, sets up the venv, runs validation through setup, runs the task from the generated project directory, and prints the artifact root. Use `--dry-run` to preview the setup and task command.
Running integration tests is still too manual.
Current process:
- install the current version of NightShift
- run `python -m nightshift.cli integ-run --template tutorial-pastebin --setup`
- copy the activation line from the output and run it
- `cd` into the generated directory
- run the task there, because running from the repo root does not find `nightshift.yaml`
Recommendation: implement a wrapper command, not just a loose script.
Target command:
```powershell
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
```
It should:
1. create the integration run
2. set up the venv
3. install NightShift from the current checkout
4. run `nightshift validate`
5. run the selected task from the generated project directory
6. print final status and artifact path
Useful variants:
```powershell
python -m nightshift.cli integ-test --template tutorial-pastebin --all
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-002 --keep 3
```
The base-directory config issue may not be a core bug, but it is bad UX. The wrapper should handle `cwd` correctly.
## P0/P1: Remove Multi-Candidate Workflow From Default Pastebin
Status: implemented for the default pastebin template and tutorial example.
Original idea:
- The multi-candidate workflow does not add as much as expected.
- Keep it as an example, maybe `example-multiagent`.
Recommendation: yes. Remove it from the default pastebin tutorial.
Reason:
- Pastebin is becoming the reliability harness.
- Multi-candidate fallback makes artifacts harder to reason about.
- It adds model variability while we are still debugging pipeline behavior.
Better split:
```text
tutorial-pastebin
tutorial-pastebin-multiagent
```
or:
```text
examples/templates/multiagent-fallback
```
Default pastebin should be boring:
```text
planner -> semantic_context -> context -> implement -> validate -> test -> review
```
Use one strong implementer first. Add fallback only in a separate experiment template.
## P1: Add A Qwen3 / 30B Pastebin Variant
Status: implemented as the default pastebin model path using `qwen3-coder:30b`.
Original idea:
- Use a non-coder model for planner roles.
- Try `qwen3.6:27b` for planning.
- Use `qwen3-coder:30b` for implementer and code-heavy roles.
Recommendation: viable, but make this a variant, not the default.
kass reply- No lets make this the default. the qwen3-coder:30b is fast now for me for some reason.
Suggested template/config:
```text
tutorial-pastebin-qwen3
```
Possible role split:
- planner: `qwen3.6:27b`
- reviewer/debugger: `qwen3.6:27b`
- implementer: `qwen3-coder:30b` or exact local 30B coder model name
Important: confirm exact model names with:
```powershell
ollama list
```
i did its `qwen3-coder:30b`
Use 30B where it pays:
- first implementation for hard tasks
- repair after concrete test failure
- schema/database changes
- multi-file changes
Do not blindly make every stage 30B if it is slow.
reply: Its not slow now!`qwen3-coder:30b`
## P2: Expose More Model Parameters
Status: implemented for the practical first set.
Supported optional Ollama fields now include `num_ctx`, `num_predict`, `seed`, and `stop`, in addition to existing `temperature`.
Original question:
- What else besides temperature is available?
- Are any worth optimizing?
Likely useful for Ollama:
- `temperature`
- `num_ctx`
- `num_predict`
- `seed`
- `stop`
- maybe `top_p`, `top_k`, `repeat_penalty`
Recommendation: add only a small practical set first.
Useful config shape:
```yaml
temperature: 0.1
num_ctx: 8192
num_predict: 4096
seed: 1
```
Most useful:
- `num_ctx`: larger repo/task context
- `num_predict`: caps runaway output
- `seed`: reproducibility, if supported consistently
- `temperature`: already useful; keep low for code
- `stop`: could help enforce file-block or diff-only contracts
Defer tuning `top_p`, `top_k`, and `repeat_penalty` unless a specific model needs it.
reply: yup lets put this in the nightshift.yaml (optional parameters, if they arent in there that's fine, but we should offer them.)
## P1: Add Test Governance For Generated Tests
Original idea:
- Have a test governance layer for when agents write tests.
- A reviewer validates alignment with acceptance criteria.
Recommendation: yes, but only for generated-test mode. Do not put generated tests back into default pastebin yet.
The previous failures proved test-writing agents will:
- edit app code
- import nonexistent modules
- require undeclared dependencies
- inspect implementation internals
- write tests for future behavior
Governance should be deterministic first, model-reviewed second.
Deterministic checks:
- test-writing stage may only touch `tests/`
- tests compile
- tests import only allowed public interfaces
- tests do not import undeclared dependencies
- tests do not define Flask routes or app implementation
- test names match current task id or current artifact
- no future-task keywords unless accepted by current task AC
Then optional model reviewer checks acceptance-criteria alignment.
## P2: Add A Test Analyzer Agent For TDD
Original idea:
- Analyze tests.
- Translate them into direct instructions for the implementer.
- Maybe implement using agent YAML definitions without new NightShift features.
Recommendation: viable, but defer until generated tests are stable.
Possible pipeline:
```text
write_tests -> validate_tests -> analyze_tests -> implement
```
Analyzer output should be concrete:
```text
Implementation requirements:
- create_app(database_path) must return a Flask app.
- POST /snippets must return 201 and JSON id.
- GET /snippets/<id> must return persisted fields.
Do not modify:
- tests/test_task001.py
```
This may help smaller models, but it is another model output that can be wrong. Add it only after the fixed-test pipeline works through all pastebin tasks.
## P2/P3: Add A Test Planner
Original idea:
- A test planner understands acceptance criteria and code.
- Provides input to the next stage about constraints and code, especially for non-TDD.
Recommendation: maybe, but defer.
This overlaps with:
- planner
- test analyzer
- test governance
Too many planning-ish stages can make the pipeline bloated and contradictory.
If implemented later, keep it focused:
```text
test_planner -> write_tests -> test_governance -> implement
```
For now, fold this idea into the future test governance/analyzer work.
## P1: Add Fixed Tests For All Pastebin Tasks
Status: mostly implemented in the template.
Current fixed tests:
```text
tests/test_task001.py
tests/test_task002.py
tests/test_task003.py
tests/test_task004.py
tests/test_task005.py
```
Important design:
```yaml
python -m pytest -q tests/test_{task_id_compact}.py
```
This lets all future task tests exist without breaking earlier tasks.
Next step: validate these through integration runs, one task at a time.
## P1: Add `nightshift integ-report`
Status: implemented as a first-pass artifact summarizer.
New idea.
Summarize latest integration run across tasks:
```text
TASK-001 complete in 1 retry
TASK-002 failed at validate_patch
Root cause: protected tests modified
Artifacts: ...
```
Right now we inspect artifacts manually. NightShift should do more of that.
Possible command:
```powershell
python -m nightshift.cli integ-report --latest
```
## P1: Add Task-Test Preflight To `validate`
Status: implemented.
`nightshift validate` now renders task command placeholders for every task and fails early if a configured `tests/test_*.py` path is missing.
Partially implemented at run time.
Current behavior:
- task command placeholders can render paths like `tests/test_task002.py`
- `run_task` preflight fails before invoking agents if the task-specific test file is missing
Better behavior:
```powershell
nightshift validate
```
should warn or fail:
```text
TASK-003 expects tests/test_task003.py and it exists.
TASK-004 expects tests/test_task004.py and it exists.
```
This catches missing fixed tests earlier.
## P2: Add Run Comparison
New idea.
Useful once comparing 14B vs 30B:
```powershell
nightshift compare-runs --latest 5
```
Show:
- model
- task
- retries
- failure stage
- final reason
- runtime
- token estimate
This should come after `integ-test` and `integ-report`.

View File

@ -1,4 +1,4 @@
# Tutorial 03: Pastebin With Model Fallback And Telemetry # Tutorial 03: Pastebin With Fixed Tests And Telemetry
This tutorial uses the `tutorial-pastebin` template: a small Flask snippet-hosting service designed for deterministic NightShift orchestration tests. This tutorial uses the `tutorial-pastebin` template: a small Flask snippet-hosting service designed for deterministic NightShift orchestration tests.
@ -19,6 +19,12 @@ For an isolated local integration run, use the integration sandbox command from
python -m nightshift.cli integ-run --template tutorial-pastebin python -m nightshift.cli integ-run --template tutorial-pastebin
``` ```
To create, set up, validate, and run one task in a single command:
```bash
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
```
To create the sandbox and set up the Python project immediately: To create the sandbox and set up the Python project immediately:
```bash ```bash
@ -57,7 +63,7 @@ pyproject.toml
README.md README.md
``` ```
The template includes a tiny Flask `create_app(database_path=None)` scaffold and fixed `TASK-001` tests. The default tutorial pipeline asks the implementation agent to make those deterministic tests pass before review. The template includes a tiny Flask `create_app(database_path=None)` scaffold and fixed tests for each tutorial task. The default tutorial pipeline asks the implementation agent to make only the current task's deterministic tests pass before review.
## Prerequisites ## Prerequisites
@ -73,26 +79,22 @@ Install target dependencies:
python -m pip install -e . pytest flask python -m pip install -e . pytest flask
``` ```
Install and start Ollama, then pull the fallback models you want available: Install and start Ollama, then pull the default pastebin model:
```bash ```bash
ollama pull qwen2.5-coder:14b ollama pull qwen3-coder:30b
ollama pull carstenuhlig/omnicoder-9b
ollama pull deepseek-coder-v2:16b
ollama list ollama list
``` ```
NightShift uses Ollama's local HTTP API, normally at `http://localhost:11434`. NightShift uses Ollama's local HTTP API, normally at `http://localhost:11434`.
## Model Fallback ## Model
The implementation stage uses this fallback order: The default pastebin pipeline uses one strong local coder model:
1. `qwen2.5-coder:14b` - `qwen3-coder:30b`
2. `carstenuhlig/omnicoder-9b`
3. `deepseek-coder-v2:16b`
NightShift records which agent/model handled each stage in `telemetry-summary.md`. NightShift records which agent/model handled each stage in `telemetry-summary.md`. Multi-candidate fallback belongs in a separate experiment template, not the default pastebin reliability harness.
## TDD Pipeline ## TDD Pipeline

View File

@ -20,51 +20,49 @@ safety:
- curl | bash - curl | bash
experiment: experiment:
label: pastebin-model-fallback label: pastebin-qwen3-coder
prompt_variant: tdd-qwen-omnicoder-deepseek-v2 prompt_variant: fixed-tests-qwen3-coder-30b-v1
agents: agents:
planner: planner:
backend: ollama backend: ollama
model: qwen2.5-coder:14b model: qwen3-coder:30b
temperature: 0.2 temperature: 0.2
num_ctx: 8192
num_predict: 4096
system_prompt: .nightshift/agents/planner.md system_prompt: .nightshift/agents/planner.md
implementer_qwen: implementer:
backend: ollama backend: ollama
model: qwen2.5-coder:14b model: qwen3-coder:30b
temperature: 0.1 temperature: 0.1
num_ctx: 8192
num_predict: 4096
system_prompt: .nightshift/agents/implementer.md system_prompt: .nightshift/agents/implementer.md
test_writer: test_writer:
backend: ollama backend: ollama
model: qwen2.5-coder:14b model: qwen3-coder:30b
temperature: 0.1 temperature: 0.1
num_ctx: 8192
num_predict: 4096
system_prompt: .nightshift/agents/test-writer.md system_prompt: .nightshift/agents/test-writer.md
implementer_omnicoder:
backend: ollama
model: carstenuhlig/omnicoder-9b
temperature: 0.1
system_prompt: .nightshift/agents/implementer.md
implementer_deepseek:
backend: ollama
model: deepseek-coder-v2:16b
temperature: 0.1
system_prompt: .nightshift/agents/implementer.md
debugger: debugger:
backend: ollama backend: ollama
model: qwen2.5-coder:14b model: qwen3-coder:30b
role: debugger role: debugger
temperature: 0.1 temperature: 0.1
num_ctx: 8192
num_predict: 4096
system_prompt: .nightshift/agents/debugger.md system_prompt: .nightshift/agents/debugger.md
reviewer: reviewer:
backend: ollama backend: ollama
model: qwen2.5-coder:14b model: qwen3-coder:30b
temperature: 0.1 temperature: 0.1
num_ctx: 8192
num_predict: 4096
system_prompt: .nightshift/agents/reviewer.md system_prompt: .nightshift/agents/reviewer.md
pipeline: pipeline:
@ -87,10 +85,7 @@ pipeline:
- id: implement - id: implement
type: file_writer type: file_writer
agent_pool: agent: implementer
- implementer_qwen
- implementer_omnicoder
- implementer_deepseek
output: proposed.patch output: proposed.patch
- id: normalize - id: normalize

View File

@ -228,8 +228,9 @@ class AgentExecutor:
"prompt": prompt, "prompt": prompt,
"stream": False, "stream": False,
} }
if agent.temperature is not None: options = _ollama_options(agent)
body["options"] = {"temperature": agent.temperature} if options:
body["options"] = options
headers = {"Content-Type": "application/json"} headers = {"Content-Type": "application/json"}
started = time.monotonic() started = time.monotonic()
self.logger.event( self.logger.event(
@ -395,6 +396,21 @@ def build_prompt_bundle(
) )
def _ollama_options(agent: AgentConfig) -> dict[str, object]:
options: dict[str, object] = {}
if agent.temperature is not None:
options["temperature"] = agent.temperature
if agent.num_ctx is not None:
options["num_ctx"] = agent.num_ctx
if agent.num_predict is not None:
options["num_predict"] = agent.num_predict
if agent.seed is not None:
options["seed"] = agent.seed
if agent.stop:
options["stop"] = list(agent.stop)
return options
def _coerce_output(value: str | bytes | None) -> str: def _coerce_output(value: str | bytes | None) -> str:
if value is None: if value is None:
return "" return ""

View File

@ -7,13 +7,16 @@ from pathlib import Path
import sys import sys
from .config import validate_config from .config import validate_config
from .errors import NightShiftError from .errors import ConfigError, NightShiftError
from .init import available_templates, init_project from .init import available_templates, init_project
from .integ import create_integration_run from .integ import create_integration_run
from .integ_report import build_integration_report, format_integration_report
from .integ_setup import format_setup_result, setup_python_project from .integ_setup import format_setup_result, setup_python_project
from .integ_test import format_integration_test_result, run_integration_test
from .pipeline import PipelineRunner from .pipeline import PipelineRunner
from .runlog import RunLogger from .runlog import RunLogger
from .status import build_status, format_status from .status import build_status, format_status
from .task_tests import check_task_test_files, format_task_test_checks, missing_task_test_paths
from .terminal import HOTDOG_ANIMATIONS, TerminalAnimation, format_banner, style_text from .terminal import HOTDOG_ANIMATIONS, TerminalAnimation, format_banner, style_text
from .tasks import ( from .tasks import (
ensure_dependencies_satisfied, ensure_dependencies_satisfied,
@ -105,6 +108,33 @@ def build_parser() -> argparse.ArgumentParser:
help="Print --setup commands without running them.", help="Print --setup commands without running them.",
) )
integ_test_parser = subparsers.add_parser(
"integ-test",
help="Create, set up, validate, and run an integration template task.",
)
integ_test_parser.add_argument("--root", default=".", help="Repository root where integ_runs/ is created.")
integ_test_parser.add_argument(
"--template",
default="tutorial-pastebin",
choices=available_templates(),
help="Template to initialize inside the sandbox.",
)
integ_test_parser.add_argument("--task", help="Specific task id to run.")
integ_test_parser.add_argument("--all", action="store_true", help="Run all runnable incomplete tasks.")
integ_test_parser.add_argument("--keep", type=int, help="Keep only the newest N old integration runs before creating a new one.")
integ_test_parser.add_argument(
"--setup-extra",
action="append",
default=["pytest"],
help="Extra package to install during setup. May be repeated. Defaults to pytest.",
)
integ_test_parser.add_argument("--setup-skip-validate", action="store_true", help="Skip validation during setup.")
integ_test_parser.add_argument("--dry-run", action="store_true", help="Print commands without running setup or tasks.")
integ_report_parser = subparsers.add_parser("integ-report", help="Summarize the latest integration run.")
integ_report_parser.add_argument("--root", default=".", help="Repository root where integ_runs/ is located.")
integ_report_parser.add_argument("--latest", action="store_true", help="Report the latest integration run.")
setup_parser = subparsers.add_parser( setup_parser = subparsers.add_parser(
"integ-setup", "integ-setup",
help="Set up a Python integration project venv and dependencies.", help="Set up a Python integration project venv and dependencies.",
@ -160,12 +190,18 @@ def main(argv: list[str] | None = None) -> int:
config = validate_config(args.config) config = validate_config(args.config)
tasks = parse_task_file(config.project.root, config.project.task_file) tasks = parse_task_file(config.project.root, config.project.task_file)
validate_task_dependencies(tasks) validate_task_dependencies(tasks)
task_test_checks = check_task_test_files(config, tasks)
missing_task_tests = missing_task_test_paths(task_test_checks)
if missing_task_tests:
details = format_task_test_checks(task_test_checks)
raise ConfigError(f"Config error: missing configured task test files.\n{details}")
incomplete = sum(1 for task in tasks if not task.completed) incomplete = sum(1 for task in tasks if not task.completed)
print(f"Config valid: {config.path}") print(f"Config valid: {config.path}")
print(f"Project: {config.project.name}") print(f"Project: {config.project.name}")
print(f"Stages: {len(config.pipeline.stages)}") print(f"Stages: {len(config.pipeline.stages)}")
print(f"Tasks: {len(tasks)}") print(f"Tasks: {len(tasks)}")
print(f"Incomplete tasks: {incomplete}") print(f"Incomplete tasks: {incomplete}")
print(format_task_test_checks(task_test_checks))
return 0 return 0
if args.command == "run": if args.command == "run":
@ -256,6 +292,25 @@ def main(argv: list[str] | None = None) -> int:
print(format_setup_result(result)) print(format_setup_result(result))
return 0 return 0
if args.command == "integ-test":
result = run_integration_test(
args.root,
template=args.template,
task=args.task,
all_tasks=args.all,
keep=args.keep,
setup_extras=tuple(args.setup_extra or ()),
skip_setup_validate=args.setup_skip_validate,
dry_run=args.dry_run,
)
print(format_integration_test_result(result))
return result.exit_code
if args.command == "integ-report":
report = build_integration_report(args.root, latest=True)
print(format_integration_report(report))
return 0
except NightShiftError as exc: except NightShiftError as exc:
print(str(exc), file=sys.stderr) print(str(exc), file=sys.stderr)
return 1 return 1

View File

@ -5,6 +5,7 @@ from __future__ import annotations
from dataclasses import dataclass from dataclasses import dataclass
import os import os
from pathlib import Path from pathlib import Path
import re
import shlex import shlex
import subprocess import subprocess
import sys import sys
@ -68,11 +69,16 @@ class CommandExecutor:
command_index=index, command_index=index,
command=command, command=command,
) )
rendered_command = render_command_template(command, task_id)
rendered_allowed_commands = tuple(
render_command_template(allowed, task_id) for allowed in self.safety.allowed_commands
)
run = self.run_command( run = self.run_command(
command, rendered_command,
shell=stage.shell, shell=stage.shell,
timeout_seconds=stage.timeout_seconds, timeout_seconds=stage.timeout_seconds,
working_dir=stage.working_dir, working_dir=stage.working_dir,
allowed_commands=rendered_allowed_commands,
) )
runs.append(run) runs.append(run)
self.logger.event( self.logger.event(
@ -120,11 +126,12 @@ class CommandExecutor:
shell: bool = True, shell: bool = True,
timeout_seconds: int | None = None, timeout_seconds: int | None = None,
working_dir: Path | None = None, working_dir: Path | None = None,
allowed_commands: tuple[str, ...] | None = None,
) -> CommandRun: ) -> CommandRun:
try: try:
normalized = ensure_command_allowed( normalized = ensure_command_allowed(
command, command,
self.safety.allowed_commands, allowed_commands if allowed_commands is not None else self.safety.allowed_commands,
self.safety.forbidden_commands, self.safety.forbidden_commands,
) )
except SafetyError as exc: except SafetyError as exc:
@ -210,6 +217,27 @@ def format_command_runs(stage_id: str, runs: list[CommandRun]) -> str:
return "\n".join(lines) return "\n".join(lines)
def render_command_template(command: str, task_id: str) -> str:
task_id_lower = task_id.lower()
task_id_slug = task_id_lower.replace("-", "_")
task_id_compact = task_id_lower.replace("-", "")
return command.format(
task_id=task_id,
task_id_lower=task_id_lower,
task_id_slug=task_id_slug,
task_id_compact=task_id_compact,
)
def extract_test_file_paths(command: str) -> tuple[str, ...]:
paths: list[str] = []
for match in re.finditer(r"(?<![\w./\\-])(tests[\\/][^\s`'\"<>|&;]+\.py)", command):
path = match.group(1).replace("\\", "/")
if path not in paths:
paths.append(path)
return tuple(paths)
def _coerce_output(value: str | bytes | None) -> str: def _coerce_output(value: str | bytes | None) -> str:
if value is None: if value is None:
return "" return ""

View File

@ -46,6 +46,10 @@ class AgentConfig:
temperature: float | None = None temperature: float | None = None
base_url: str | None = None base_url: str | None = None
api_key_env: str | None = None api_key_env: str | None = None
num_ctx: int | None = None
num_predict: int | None = None
seed: int | None = None
stop: tuple[str, ...] = ()
@dataclass(frozen=True) @dataclass(frozen=True)
@ -207,10 +211,18 @@ def parse_config(raw: dict[str, Any], config_path: Path) -> NightShiftConfig:
agent_raw.get("temperature"), agent_raw.get("temperature"),
f"agents.{agent_id}.temperature", f"agents.{agent_id}.temperature",
) )
num_ctx = _optional_int_or_none(agent_raw.get("num_ctx"), f"agents.{agent_id}.num_ctx")
num_predict = _optional_int_or_none(agent_raw.get("num_predict"), f"agents.{agent_id}.num_predict")
seed = _optional_int_or_none(agent_raw.get("seed"), f"agents.{agent_id}.seed")
stop = _string_tuple(agent_raw.get("stop", []), f"agents.{agent_id}.stop")
if temperature is not None and temperature < 0: if temperature is not None and temperature < 0:
raise ConfigError( raise ConfigError(
f"Config error: agents.{agent_id}.temperature must be zero or greater." f"Config error: agents.{agent_id}.temperature must be zero or greater."
) )
if num_ctx is not None and num_ctx <= 0:
raise ConfigError(f"Config error: agents.{agent_id}.num_ctx must be greater than zero.")
if num_predict is not None and num_predict <= 0:
raise ConfigError(f"Config error: agents.{agent_id}.num_predict must be greater than zero.")
if backend not in {"command", "ollama", "openai_compatible"}: if backend not in {"command", "ollama", "openai_compatible"}:
raise ConfigError( raise ConfigError(
f"Config error: agent '{agent_id}' uses unsupported backend '{backend}'. " f"Config error: agent '{agent_id}' uses unsupported backend '{backend}'. "
@ -243,6 +255,10 @@ def parse_config(raw: dict[str, Any], config_path: Path) -> NightShiftConfig:
temperature=temperature, temperature=temperature,
base_url=base_url, base_url=base_url,
api_key_env=api_key_env, api_key_env=api_key_env,
num_ctx=num_ctx,
num_predict=num_predict,
seed=seed,
stop=stop,
) )
experiment_raw = raw.get("experiment", {}) experiment_raw = raw.get("experiment", {})

View File

@ -0,0 +1,71 @@
"""Summarize integration run artifacts."""
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
import re
from .errors import NightShiftError
@dataclass(frozen=True)
class IntegrationReport:
integration_run: Path
nightshift_run: Path | None
lines: tuple[str, ...]
def build_integration_report(root: str | Path = ".", *, latest: bool = True) -> IntegrationReport:
base = Path(root).resolve() / "integ_runs"
if not base.exists():
raise NightShiftError(f"Integration report error: no integ_runs directory found: {base}")
runs = sorted((path for path in base.iterdir() if path.is_dir()), key=lambda path: path.name, reverse=True)
if not runs:
raise NightShiftError(f"Integration report error: no integration runs found under: {base}")
integration_run = runs[0] if latest else runs[0]
artifacts_root = integration_run / "project" / ".nightshift" / "runs"
if not artifacts_root.exists():
return IntegrationReport(
integration_run,
None,
("No NightShift run artifacts found. Setup may have failed before task execution.",),
)
nightshift_runs = sorted((path for path in artifacts_root.iterdir() if path.is_dir()), key=lambda path: path.name, reverse=True)
if not nightshift_runs:
return IntegrationReport(integration_run, None, ("No NightShift run directories found.",))
nightshift_run = nightshift_runs[0]
summaries = sorted(nightshift_run.glob("tasks/*/run-summary.md"))
if not summaries and (nightshift_run / "run-summary.md").exists():
summaries = [nightshift_run / "run-summary.md"]
lines = [_summarize_run_summary(path, integration_run) for path in summaries]
return IntegrationReport(integration_run, nightshift_run, tuple(lines or ("No task summaries found.",)))
def format_integration_report(report: IntegrationReport) -> str:
lines = [f"Integration run: {report.integration_run}"]
if report.nightshift_run is not None:
lines.append(f"NightShift run: {report.nightshift_run}")
lines.append("")
lines.extend(f"- {line}" for line in report.lines)
return "\n".join(lines)
def _summarize_run_summary(path: Path, integration_run: Path) -> str:
text = path.read_text(encoding="utf-8", errors="replace")
task = _field(text, "Task") or path.parent.name
status = _field(text, "Status") or "unknown"
retries = _field(text, "Retry count") or "unknown"
reason = _field(text, "Reason") or "no reason recorded"
try:
relative = path.relative_to(integration_run)
except ValueError:
relative = path
return f"{task} {status} after {retries} retries. Reason: {reason}. Artifacts: {relative.parent}"
def _field(text: str, name: str) -> str | None:
match = re.search(rf"^- {re.escape(name)}:\s*(.+)$", text, flags=re.MULTILINE)
if not match:
return None
return match.group(1).strip()

71
nightshift/integ_test.py Normal file
View File

@ -0,0 +1,71 @@
"""End-to-end integration test wrapper."""
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
import subprocess
from .errors import NightShiftError
from .integ import IntegrationRun, create_integration_run
from .integ_setup import IntegrationSetupResult, setup_python_project
@dataclass(frozen=True)
class IntegrationTestResult:
run: IntegrationRun
setup: IntegrationSetupResult
command: tuple[str, ...]
exit_code: int
dry_run: bool
def run_integration_test(
root: str | Path = ".",
*,
template: str = "tutorial-pastebin",
task: str | None = None,
all_tasks: bool = False,
keep: int | None = None,
setup_extras: tuple[str, ...] = ("pytest",),
skip_setup_validate: bool = False,
dry_run: bool = False,
) -> IntegrationTestResult:
if task and all_tasks:
raise NightShiftError("Integration test error: use either --task or --all, not both.")
if not task and not all_tasks:
raise NightShiftError("Integration test error: provide --task or --all.")
run = create_integration_run(Path(root), template=template, keep=keep)
project = run.directory / "project"
setup = setup_python_project(
project,
extras=setup_extras,
validate=not skip_setup_validate,
dry_run=dry_run,
)
command = [str(setup.python), "-m", "nightshift.cli", "run", "--no-animation"]
if all_tasks:
command.append("--all")
else:
command.extend(["--task", task or ""])
exit_code = 0
if not dry_run:
completed = subprocess.run(command, cwd=project, text=True, encoding="utf-8", errors="replace")
exit_code = completed.returncode
return IntegrationTestResult(run, setup, tuple(command), exit_code, dry_run)
def format_integration_test_result(result: IntegrationTestResult) -> str:
lines = [
f"Integration run: {result.run.directory}",
f"Project: {result.run.directory / 'project'}",
f"Venv: {result.run.venv_dir}",
f"Run command: {' '.join(result.command)}",
f"Exit code: {result.exit_code}",
f"Artifacts: {result.run.directory / 'project' / '.nightshift'}",
]
if result.dry_run:
lines.insert(3, "Dry run: true")
return "\n".join(lines)

View File

@ -9,7 +9,7 @@ import subprocess
from .agents import AgentExecutor from .agents import AgentExecutor
from .artifacts import ArtifactStore from .artifacts import ArtifactStore
from .commands import CommandExecutor from .commands import CommandExecutor, extract_test_file_paths, render_command_template
from .config import COMMAND_STAGE_TYPES, NightShiftConfig, StageConfig from .config import COMMAND_STAGE_TYPES, NightShiftConfig, StageConfig
from .context import ContextManager from .context import ContextManager
from .dependencies import diagnose_python_dependencies, format_dependency_diagnostic from .dependencies import diagnose_python_dependencies, format_dependency_diagnostic
@ -145,6 +145,12 @@ class PipelineRunner:
index = 0 index = 0
final_status = "complete" final_status = "complete"
final_reason = "Pipeline completed." final_reason = "Pipeline completed."
preflight_result = self._preflight_task(task, stages)
if preflight_result:
stage_results.append(preflight_result)
final_status = "failed"
final_reason = preflight_result.reason
index = len(stages)
while index < len(stages): while index < len(stages):
stage = stages[index] stage = stages[index]
@ -248,6 +254,13 @@ class PipelineRunner:
"retry-memory.md", "retry-memory.md",
summarize_retry_memory(tuple(retry_memory)), summarize_retry_memory(tuple(retry_memory)),
) )
if _repeated_protected_path_violation(tuple(retry_memory)):
final_status = "failed"
final_reason = (
"Escalation policy stopped retries: implementation repeatedly "
"attempted to modify paths outside the stage allowlist."
)
break
decision = evaluate_retry_churn( decision = evaluate_retry_churn(
tuple(retry_memory), tuple(retry_memory),
retry_budget=self.config.pipeline.max_task_retries + 1, retry_budget=self.config.pipeline.max_task_retries + 1,
@ -334,6 +347,45 @@ class PipelineRunner:
reason=final_reason, reason=final_reason,
) )
def _preflight_task(self, task: Task, stages: list[StageConfig]) -> StageResult | None:
missing_paths: list[str] = []
for stage in stages:
if stage.type not in COMMAND_STAGE_TYPES:
continue
for command in stage.commands:
rendered = render_command_template(command, task.id)
for path_text in extract_test_file_paths(rendered):
if not (self.config.project.root / path_text).exists():
missing_paths.append(path_text)
if not missing_paths:
return None
unique_paths = tuple(dict.fromkeys(missing_paths))
details = "\n".join(f"- `{path}`" for path in unique_paths)
output_path = self.artifacts.write_stage_output(
task.id,
"preflight.md",
"\n".join(
[
"# Task Preflight",
"",
"Status: fail",
"Reason: configured task test file is missing.",
"",
"## Missing Files",
"",
details,
"",
]
),
)
return StageResult(
"preflight",
"fail",
"Task preflight failed: configured task test file is missing: "
+ ", ".join(unique_paths),
output_path=str(output_path.relative_to(self.config.project.root)),
)
def run_tasks(self, tasks: list[Task] | tuple[Task, ...]) -> MultiTaskResult: def run_tasks(self, tasks: list[Task] | tuple[Task, ...]) -> MultiTaskResult:
self.artifacts.initialize_run() self.artifacts.initialize_run()
self.logger.bind(self.artifacts) self.logger.bind(self.artifacts)
@ -1428,6 +1480,18 @@ def _extract_exit_code(text: str) -> int | None:
return None return None
def _repeated_protected_path_violation(entries: tuple[RetryMemoryEntry, ...]) -> bool:
recent = entries[-2:]
if len(recent) < 2:
return False
return all(_is_protected_path_violation(entry.cause) for entry in recent)
def _is_protected_path_violation(text: str) -> bool:
lowered = text.lower()
return "not allowed for this stage" in lowered and "tests/" in lowered.replace("\\", "/")
def format_aggregate_run_summary(results: list[PipelineResult], status: str, reason: str) -> str: def format_aggregate_run_summary(results: list[PipelineResult], status: str, reason: str) -> str:
lines = [ lines = [
"# Run Summary", "# Run Summary",

View File

@ -1,9 +1,11 @@
You are the debugger agent for the NightShift pastebin tutorial. You are the debugger agent for the NightShift pastebin tutorial.
Diagnose failed attempts without editing files. Diagnose failed attempts without editing files.
Distinguish inaccurate generated tests from implementation bugs. Distinguish fixed-test/template problems from implementation bugs.
If tests are inaccurate for the current task, recommend retrying `write_tests`. This tutorial uses fixed task tests and task-specific pytest commands. Do not recommend `write_tests` unless the configured pipeline actually has a `write_tests` stage.
If a current task appears to lack tests, report a template or test-selection problem.
If implementation is wrong, recommend the smallest implementation repair and name files that should not be modified. If implementation is wrong, recommend the smallest implementation repair and name files that should not be modified.
Implementation agents must not edit files under `tests/`.
Return: Return:
- concise diagnosis - concise diagnosis
- recommended next action - recommended next action

View File

@ -7,8 +7,10 @@ Do not add behavior for future tasks unless needed to satisfy the current tests.
Use Flask and `sqlite3` from the Python standard library. Do not use SQLAlchemy, Flask-SQLAlchemy, or undeclared dependencies. Use Flask and `sqlite3` from the Python standard library. Do not use SQLAlchemy, Flask-SQLAlchemy, or undeclared dependencies.
Keep the public package name `pastebin_app`. Keep the public package name `pastebin_app`.
Keep the public app entry point `create_app(database_path: str | None = None)`. Keep the public app entry point `create_app(database_path: str | None = None)`.
Respect `database_path`; do not hard-code `snippets.db` when a database path is supplied.
Tests should interact through HTTP routes and `create_app`, not through ORM/session globals. Tests should interact through HTTP routes and `create_app`, not through ORM/session globals.
Do not use `app.before_first_request`; recent Flask versions removed it. Initialize required database tables inside `create_app` or inside the route helper before use. Do not use `app.before_first_request`; recent Flask versions removed it. Initialize required database tables inside `create_app` or inside the route helper before use.
When adding columns to an existing sqlite table, handle existing databases idempotently with `ALTER TABLE` checks or a simple migration helper. `CREATE TABLE IF NOT EXISTS` does not add columns to an existing table.
Output only complete file content blocks. Output only complete file content blocks.
Use one fenced block per file: Use one fenced block per file:

View File

@ -14,6 +14,12 @@ Or create an isolated integration sandbox from the NightShift repository root:
python -m nightshift.cli integ-run --template tutorial-pastebin python -m nightshift.cli integ-run --template tutorial-pastebin
``` ```
To create, set up, validate, and run one task in a single command:
```bash
python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
```
To create the sandbox and set it up in one step: To create the sandbox and set it up in one step:
```bash ```bash
@ -48,12 +54,8 @@ nightshift what-happened
When running from an integration sandbox, the same commands are run inside `integ_runs/<timestamp>/project`. When running from an integration sandbox, the same commands are run inside `integ_runs/<timestamp>/project`.
The pipeline uses model fallback ordering for implementation attempts: The default pastebin pipeline uses `qwen3-coder:30b` for planning, implementation, debugging, test review, and final review. It intentionally does not use multi-candidate fallback; pastebin is the deterministic reliability harness.
1. `qwen2.5-coder:14b`
2. `carstenuhlig/omnicoder-9b`
3. `deepseek-coder-v2:16b`
Telemetry artifacts record which agent/model handled each stage and estimate token usage. Telemetry artifacts record which agent/model handled each stage and estimate token usage.
This template uses a TDD-oriented pipeline. It starts with a skeletal package, generates task-specific pytest tests from the current task acceptance criteria, reviews those tests for scope, and then implements only enough application code to pass them. This template uses fixed task-specific pytest files. The pipeline starts with a skeletal package, implements only the current task, runs `tests/test_{task_id_compact}.py`, and then reviews the result.

View File

@ -20,51 +20,49 @@ safety:
- curl | bash - curl | bash
experiment: experiment:
label: pastebin-model-fallback label: pastebin-qwen3-coder
prompt_variant: tdd-qwen-omnicoder-deepseek-v2 prompt_variant: fixed-tests-qwen3-coder-30b-v1
agents: agents:
planner: planner:
backend: ollama backend: ollama
model: qwen2.5-coder:14b model: qwen3-coder:30b
temperature: 0.2 temperature: 0.2
num_ctx: 8192
num_predict: 4096
system_prompt: .nightshift/agents/planner.md system_prompt: .nightshift/agents/planner.md
implementer_qwen: implementer:
backend: ollama backend: ollama
model: qwen2.5-coder:14b model: qwen3-coder:30b
temperature: 0.1 temperature: 0.1
num_ctx: 8192
num_predict: 4096
system_prompt: .nightshift/agents/implementer.md system_prompt: .nightshift/agents/implementer.md
test_writer: test_writer:
backend: ollama backend: ollama
model: qwen2.5-coder:14b model: qwen3-coder:30b
temperature: 0.1 temperature: 0.1
num_ctx: 8192
num_predict: 4096
system_prompt: .nightshift/agents/test-writer.md system_prompt: .nightshift/agents/test-writer.md
implementer_omnicoder:
backend: ollama
model: carstenuhlig/omnicoder-9b
temperature: 0.1
system_prompt: .nightshift/agents/implementer.md
implementer_deepseek:
backend: ollama
model: deepseek-coder-v2:16b
temperature: 0.1
system_prompt: .nightshift/agents/implementer.md
debugger: debugger:
backend: ollama backend: ollama
model: qwen2.5-coder:14b model: qwen3-coder:30b
role: debugger role: debugger
temperature: 0.1 temperature: 0.1
num_ctx: 8192
num_predict: 4096
system_prompt: .nightshift/agents/debugger.md system_prompt: .nightshift/agents/debugger.md
reviewer: reviewer:
backend: ollama backend: ollama
model: qwen2.5-coder:14b model: qwen3-coder:30b
temperature: 0.1 temperature: 0.1
num_ctx: 8192
num_predict: 4096
system_prompt: .nightshift/agents/reviewer.md system_prompt: .nightshift/agents/reviewer.md
pipeline: pipeline:
@ -87,10 +85,7 @@ pipeline:
- id: implement - id: implement
type: file_writer type: file_writer
agent_pool: agent: implementer
- implementer_qwen
- implementer_omnicoder
- implementer_deepseek
output: proposed.patch output: proposed.patch
- id: normalize - id: normalize

View File

@ -16,6 +16,7 @@ def test_create_snippet_returns_created_snippet_id(tmp_path):
assert response.status_code == 201 assert response.status_code == 201
data = response.get_json() data = response.get_json()
assert isinstance(data["id"], int) assert isinstance(data["id"], int)
assert (tmp_path / "snippets.db").exists()
def test_view_snippet_returns_persisted_fields(tmp_path): def test_view_snippet_returns_persisted_fields(tmp_path):
@ -38,6 +39,7 @@ def test_view_snippet_returns_persisted_fields(tmp_path):
"title": "View me", "title": "View me",
"body": "stored body", "body": "stored body",
} }
assert (tmp_path / "snippets.db").exists()
def test_view_missing_snippet_returns_404(tmp_path): def test_view_missing_snippet_returns_404(tmp_path):

View File

@ -0,0 +1,50 @@
from pastebin_app.app import create_app
def test_create_snippet_accepts_optional_metadata(tmp_path):
app = create_app(database_path=str(tmp_path / "snippets.db"))
client = app.test_client()
response = client.post(
"/snippets",
json={
"title": "Tagged",
"body": "metadata body",
"language": "python",
"tags": ["alpha", "beta"],
"expires_at": "2030-01-01T00:00:00",
},
)
assert response.status_code == 201
assert isinstance(response.get_json()["id"], int)
assert (tmp_path / "snippets.db").exists()
def test_view_snippet_returns_optional_metadata(tmp_path):
app = create_app(database_path=str(tmp_path / "snippets.db"))
client = app.test_client()
created = client.post(
"/snippets",
json={
"title": "Tagged",
"body": "metadata body",
"language": "python",
"tags": ["alpha", "beta"],
"expires_at": "2030-01-01T00:00:00",
},
).get_json()
response = client.get(f"/snippets/{created['id']}")
assert response.status_code == 200
assert response.get_json() == {
"id": created["id"],
"title": "Tagged",
"body": "metadata body",
"language": "python",
"tags": ["alpha", "beta"],
"expires_at": "2030-01-01T00:00:00",
}
assert (tmp_path / "snippets.db").exists()

View File

@ -0,0 +1,47 @@
from pastebin_app.app import create_app
def _create(client, title, body, **metadata):
response = client.post("/snippets", json={"title": title, "body": body, **metadata})
assert response.status_code == 201
return response.get_json()["id"]
def test_list_snippets_newest_first(tmp_path):
app = create_app(database_path=str(tmp_path / "snippets.db"))
client = app.test_client()
first_id = _create(client, "First", "older")
second_id = _create(client, "Second", "newer")
response = client.get("/snippets")
assert response.status_code == 200
ids = [snippet["id"] for snippet in response.get_json()]
assert ids[:2] == [second_id, first_id]
def test_search_filters_by_title_or_body(tmp_path):
app = create_app(database_path=str(tmp_path / "snippets.db"))
client = app.test_client()
_create(client, "Python note", "ordinary body")
_create(client, "Other", "contains needle")
response = client.get("/snippets?q=python")
assert [snippet["title"] for snippet in response.get_json()] == ["Python note"]
response = client.get("/snippets?q=needle")
assert [snippet["title"] for snippet in response.get_json()] == ["Other"]
def test_language_and_tag_filters(tmp_path):
app = create_app(database_path=str(tmp_path / "snippets.db"))
client = app.test_client()
_create(client, "Python", "body", language="python", tags=["code", "demo"])
_create(client, "Text", "body", language="text", tags=["notes"])
response = client.get("/snippets?language=python")
assert [snippet["title"] for snippet in response.get_json()] == ["Python"]
response = client.get("/snippets?tag=notes")
assert [snippet["title"] for snippet in response.get_json()] == ["Text"]

View File

@ -0,0 +1,43 @@
from pastebin_app.app import create_app
def test_expired_snippets_are_excluded_from_listing(tmp_path):
app = create_app(database_path=str(tmp_path / "snippets.db"))
client = app.test_client()
client.post(
"/snippets",
json={"title": "Expired", "body": "old", "expires_at": "2000-01-01T00:00:00"},
)
active = client.post(
"/snippets",
json={"title": "Active", "body": "new", "expires_at": "2999-01-01T00:00:00"},
).get_json()
response = client.get("/snippets")
assert response.status_code == 200
assert [snippet["id"] for snippet in response.get_json()] == [active["id"]]
def test_direct_lookup_of_expired_snippet_returns_410(tmp_path):
app = create_app(database_path=str(tmp_path / "snippets.db"))
client = app.test_client()
expired = client.post(
"/snippets",
json={"title": "Expired", "body": "old", "expires_at": "2000-01-01T00:00:00"},
).get_json()
response = client.get(f"/snippets/{expired['id']}")
assert response.status_code == 410
def test_non_expiring_snippet_remains_visible(tmp_path):
app = create_app(database_path=str(tmp_path / "snippets.db"))
client = app.test_client()
created = client.post("/snippets", json={"title": "Forever", "body": "body"}).get_json()
response = client.get(f"/snippets/{created['id']}")
assert response.status_code == 200
assert response.get_json()["title"] == "Forever"

View File

@ -0,0 +1,46 @@
from pastebin_app.app import create_app
def test_root_shows_snippet_list_html(tmp_path):
app = create_app(database_path=str(tmp_path / "snippets.db"))
client = app.test_client()
client.post("/snippets", json={"title": "Visible", "body": "body"})
response = client.get("/")
assert response.status_code == 200
assert "Visible" in response.get_data(as_text=True)
def test_new_snippet_form_loads(tmp_path):
app = create_app(database_path=str(tmp_path / "snippets.db"))
client = app.test_client()
response = client.get("/new")
assert response.status_code == 200
html = response.get_data(as_text=True)
assert 'name="title"' in html
assert 'name="body"' in html
assert 'name="language"' in html
assert 'name="tags"' in html
assert 'name="expires_at"' in html
def test_form_post_redirects_to_snippet_view(tmp_path):
app = create_app(database_path=str(tmp_path / "snippets.db"))
client = app.test_client()
response = client.post(
"/new",
data={
"title": "Form title",
"body": "Form body",
"language": "text",
"tags": "forms,html",
"expires_at": "",
},
)
assert response.status_code == 302
assert response.headers["Location"].endswith("/snippets/1")

48
nightshift/task_tests.py Normal file
View File

@ -0,0 +1,48 @@
"""Task-specific test file validation."""
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from .commands import extract_test_file_paths, render_command_template
from .config import COMMAND_STAGE_TYPES, NightShiftConfig
from .tasks import Task
@dataclass(frozen=True)
class TaskTestCheck:
task_id: str
path: str
exists: bool
def check_task_test_files(config: NightShiftConfig, tasks: tuple[Task, ...] | list[Task]) -> tuple[TaskTestCheck, ...]:
checks: list[TaskTestCheck] = []
for task in tasks:
seen: set[str] = set()
for stage in config.pipeline.stages:
if stage.type not in COMMAND_STAGE_TYPES:
continue
for command in stage.commands:
rendered = render_command_template(command, task.id)
for path_text in extract_test_file_paths(rendered):
if path_text in seen:
continue
seen.add(path_text)
checks.append(TaskTestCheck(task.id, path_text, (config.project.root / path_text).exists()))
return tuple(checks)
def format_task_test_checks(checks: tuple[TaskTestCheck, ...]) -> str:
if not checks:
return "Task test files: no task-specific test paths detected."
lines = ["Task test files:"]
for check in checks:
status = "ok" if check.exists else "missing"
lines.append(f"- {check.task_id}: {check.path} ({status})")
return "\n".join(lines)
def missing_task_test_paths(checks: tuple[TaskTestCheck, ...]) -> tuple[Path, ...]:
return tuple(Path(check.path) for check in checks if not check.exists)

View File

@ -6,6 +6,7 @@ from nightshift.artifacts import ArtifactStore
from nightshift.commands import CommandExecutor from nightshift.commands import CommandExecutor
from nightshift.commands import CommandRun, format_command_runs from nightshift.commands import CommandRun, format_command_runs
from nightshift.commands import _command_env from nightshift.commands import _command_env
from nightshift.commands import render_command_template
from nightshift.config import SafetyConfig, StageConfig from nightshift.config import SafetyConfig, StageConfig
from nightshift.errors import CommandError from nightshift.errors import CommandError
import sys import sys
@ -16,6 +17,13 @@ FAILING_COMMAND = 'python -c "import sys; print(\'bad\'); sys.exit(7)"'
class CommandExecutorTests(unittest.TestCase): class CommandExecutorTests(unittest.TestCase):
def test_render_command_template_includes_task_id_variants(self) -> None:
command = "python -m pytest -q tests/test_{task_id_compact}.py # {task_id_slug} {task_id}"
rendered = render_command_template(command, "TASK-001")
self.assertEqual(rendered, "python -m pytest -q tests/test_task001.py # task_001 TASK-001")
def test_passing_command_stage_returns_pass_and_writes_output(self) -> None: def test_passing_command_stage_returns_pass_and_writes_output(self) -> None:
with tempfile.TemporaryDirectory() as directory: with tempfile.TemporaryDirectory() as directory:
root = Path(directory) root = Path(directory)
@ -46,6 +54,33 @@ class CommandExecutorTests(unittest.TestCase):
self.assertIn("Exit code: 0", output) self.assertIn("Exit code: 0", output)
self.assertIn("ok", output) self.assertIn("ok", output)
def test_command_stage_renders_task_id_before_allowlist_check(self) -> None:
with tempfile.TemporaryDirectory() as directory:
root = Path(directory)
artifacts = ArtifactStore(root, ".nightshift", run_id="test-run")
executor = CommandExecutor(
root,
SafetyConfig(
require_clean_worktree=False,
scoped_paths=(".",),
allowed_commands=('python -c "print(\'{task_id_compact}\')"',),
forbidden_commands=("rm -rf",),
),
artifacts,
)
stage = StageConfig(
id="test",
type="command",
commands=('python -c "print(\'{task_id_compact}\')"',),
output="test-output.txt",
)
result = executor.run_stage(stage, "TASK-002")
self.assertEqual(result.status, "pass")
output = (root / result.output_path).read_text(encoding="utf-8")
self.assertIn("task002", output)
def test_failing_command_stage_returns_fail_and_writes_output(self) -> None: def test_failing_command_stage_returns_fail_and_writes_output(self) -> None:
with tempfile.TemporaryDirectory() as directory: with tempfile.TemporaryDirectory() as directory:
root = Path(directory) root = Path(directory)

View File

@ -282,6 +282,27 @@ class ConfigTests(unittest.TestCase):
self.assertEqual(config.agents["planner"].temperature, 0.2) self.assertEqual(config.agents["planner"].temperature, 0.2)
def test_agent_ollama_options_load(self) -> None:
with tempfile.TemporaryDirectory() as directory:
root = Path(directory)
init_project(root)
config_path = root / "nightshift.yaml"
config_path.write_text(
config_path.read_text(encoding="utf-8").replace(
" system_prompt: agents/planner.md",
" system_prompt: agents/planner.md\n num_ctx: 8192\n num_predict: 4096\n seed: 1\n stop:\n - STOP",
1,
),
encoding="utf-8",
)
config = load_config(config_path)
self.assertEqual(config.agents["planner"].num_ctx, 8192)
self.assertEqual(config.agents["planner"].num_predict, 4096)
self.assertEqual(config.agents["planner"].seed, 1)
self.assertEqual(config.agents["planner"].stop, ("STOP",))
def test_agent_temperature_must_be_number(self) -> None: def test_agent_temperature_must_be_number(self) -> None:
with tempfile.TemporaryDirectory() as directory: with tempfile.TemporaryDirectory() as directory:
root = Path(directory) root = Path(directory)

View File

@ -61,7 +61,7 @@ class InitProjectTests(unittest.TestCase):
self.assertIn("tutorial-imageboard", available_templates()) self.assertIn("tutorial-imageboard", available_templates())
self.assertIn("tutorial-pastebin", available_templates()) self.assertIn("tutorial-pastebin", available_templates())
def test_init_pastebin_template_creates_skeleton_and_model_fallback_config(self) -> None: def test_init_pastebin_template_creates_skeleton_and_qwen3_config(self) -> None:
with tempfile.TemporaryDirectory() as directory: with tempfile.TemporaryDirectory() as directory:
root = Path(directory) root = Path(directory)
@ -78,11 +78,15 @@ class InitProjectTests(unittest.TestCase):
self.assertIn("type: semantic_context", config) self.assertIn("type: semantic_context", config)
self.assertNotIn("id: write_tests", config) self.assertNotIn("id: write_tests", config)
self.assertNotIn("id: review_tests", config) self.assertNotIn("id: review_tests", config)
self.assertIn("python -m pytest -q tests", config) self.assertIn("python -m pytest -q tests/test_{task_id_compact}.py", config)
self.assertIn("max_task_retries: 6", config) self.assertIn("max_task_retries: 6", config)
self.assertIn("implementer_qwen", config) self.assertIn("implementer:", config)
self.assertIn("carstenuhlig/omnicoder-9b", config) self.assertIn("qwen3-coder:30b", config)
self.assertIn("deepseek-coder-v2:16b", config) self.assertIn("num_ctx: 8192", config)
self.assertIn("num_predict: 4096", config)
self.assertNotIn("agent_pool:", config)
self.assertNotIn("carstenuhlig/omnicoder-9b", config)
self.assertNotIn("deepseek-coder-v2:16b", config)
def test_pastebin_example_tutorial_docs_exist(self) -> None: def test_pastebin_example_tutorial_docs_exist(self) -> None:
root = Path(__file__).resolve().parents[1] root = Path(__file__).resolve().parents[1]

51
tests/test_integ_test.py Normal file
View File

@ -0,0 +1,51 @@
from pathlib import Path
import tempfile
import unittest
from nightshift.integ_report import build_integration_report, format_integration_report
from nightshift.integ_test import format_integration_test_result, run_integration_test
class IntegrationTestCommandTests(unittest.TestCase):
def test_run_integration_test_dry_run_builds_task_command(self) -> None:
with tempfile.TemporaryDirectory() as directory:
result = run_integration_test(
directory,
template="tutorial-pastebin",
task="TASK-001",
dry_run=True,
)
rendered = format_integration_test_result(result)
self.assertIn("Dry run: true", rendered)
self.assertIn("TASK-001", " ".join(result.command))
self.assertTrue((result.run.directory / "project" / "nightshift.yaml").exists())
def test_build_integration_report_summarizes_latest_task_summary(self) -> None:
with tempfile.TemporaryDirectory() as directory:
root = Path(directory)
summary = root / "integ_runs" / "20260521T000000.000000Z" / "project" / ".nightshift" / "runs" / "run1" / "tasks" / "TASK-001" / "run-summary.md"
summary.parent.mkdir(parents=True)
summary.write_text(
"\n".join(
[
"# Run Summary",
"",
"- Task: TASK-001",
"- Status: complete",
"- Retry count: 1",
"- Reason: Done.",
]
),
encoding="utf-8",
)
report = build_integration_report(root)
rendered = format_integration_report(report)
self.assertIn("TASK-001 complete after 1 retries", rendered)
self.assertIn("Reason: Done.", rendered)
if __name__ == "__main__":
unittest.main()

View File

@ -105,6 +105,29 @@ class PipelineRunnerTests(unittest.TestCase):
) )
self.assertIn("Modified Files", (root / ".nightshift" / "runs" / "test-run" / "run-summary.md").read_text(encoding="utf-8")) self.assertIn("Modified Files", (root / ".nightshift" / "runs" / "test-run" / "run-summary.md").read_text(encoding="utf-8"))
def test_task_preflight_fails_when_task_specific_test_file_is_missing(self) -> None:
with tempfile.TemporaryDirectory() as directory:
root = Path(directory)
_write_common_files(root)
stages = (
StageConfig(
id="test",
type="command",
commands=("python -m pytest -q tests/test_{task_id_compact}.py",),
output="test-output.txt",
),
)
config = make_config(root, stages, max_retries=0)
runner = PipelineRunner(config, ArtifactStore(root, ".nightshift", run_id="test-run"))
task = parse_tasks(TASK_MD)[0]
result = runner.run_task(task)
self.assertEqual(result.status, "failed")
self.assertIn("configured task test file is missing", result.reason)
task_dir = root / ".nightshift" / "runs" / "test-run" / "tasks" / task.id
self.assertIn("tests/test_task001.py", (task_dir / "preflight.md").read_text(encoding="utf-8"))
def test_review_can_retry_implementation_until_limit(self) -> None: def test_review_can_retry_implementation_until_limit(self) -> None:
with tempfile.TemporaryDirectory() as directory: with tempfile.TemporaryDirectory() as directory:
root = Path(directory) root = Path(directory)

77
tests/test_task_tests.py Normal file
View File

@ -0,0 +1,77 @@
from pathlib import Path
import tempfile
import unittest
from nightshift.config import validate_config
from nightshift.task_tests import check_task_test_files, missing_task_test_paths
from nightshift.tasks import parse_task_file
class TaskTestValidationTests(unittest.TestCase):
def test_check_task_test_files_renders_task_placeholder(self) -> None:
with tempfile.TemporaryDirectory() as directory:
root = Path(directory)
(root / "agents").mkdir()
(root / "agents" / "planner.md").write_text("Prompt", encoding="utf-8")
(root / "tests").mkdir()
(root / "tests" / "test_task001.py").write_text("def test_ok():\n assert True\n", encoding="utf-8")
(root / "nightshift.yaml").write_text(
"\n".join(
[
"project:",
" name: task-test-validation",
" root: .",
" task_file: tasks.md",
" artifact_dir: .nightshift",
"",
"safety:",
" require_clean_worktree: false",
" scoped_paths:",
" - .",
" allowed_commands:",
" - python -m pytest -q tests/test_{task_id_compact}.py",
" forbidden_commands:",
" - rm -rf",
"",
"agents:",
" planner:",
" backend: command",
" command: python -c \"print('ok')\"",
" system_prompt: agents/planner.md",
"",
"pipeline:",
" stages:",
" - id: test",
" type: command",
" commands:",
" - python -m pytest -q tests/test_{task_id_compact}.py",
]
),
encoding="utf-8",
)
(root / "tasks.md").write_text(
"""# Tasks
- [ ] TASK-001: One
Acceptance Criteria:
- passes
- [ ] TASK-002: Two
Acceptance Criteria:
- reports missing test
""",
encoding="utf-8",
)
config = validate_config(root / "nightshift.yaml")
tasks = parse_task_file(config.project.root, config.project.task_file)
checks = check_task_test_files(config, tasks)
self.assertEqual([check.path for check in checks], ["tests/test_task001.py", "tests/test_task002.py"])
self.assertEqual(tuple(path.as_posix() for path in missing_task_test_paths(checks)), ("tests/test_task002.py",))
if __name__ == "__main__":
unittest.main()