Add tutorial integration workflow helpers

- Add `integ-test` to create, set up, validate, and run integration template tasks - Add `integ-report` to summarize latest integration run artifacts - Switch default pastebin template from model fallback to single `qwen3-coder:30b` - Support optional Ollama fields: `num_ctx`, `num_predict`, `seed`, and `stop` - Add `nightshift validate` preflight for task-specific test files - Update pastebin docs, config reference, and ideas tracking - Add tests for integration helpers, task-test validation, config parsing, and template expectations
2026-06-14 18:18:36 +00:00 · 2026-05-21 03:46:27 -07:00 · 2026-05-21 03:46:27 -07:00 · f7fed4535b
commit f7fed4535b
parent e3679296fd
29 changed files with 1251 additions and 280 deletions
--- a/docs/bugfix_todo.md
+++ b/docs/bugfix_todo.md
@ -1,195 +0,0 @@
 # Bugfix TODO
 ## Some issues going with run --all
 reason=Stage 'review' requested unknown next stage 'None'. Not every time. I think there's a pattern that is out of place here. Maybe it's related to the last task success? Or the last run?
 ## Going from individual tasks to --all fails
 If you do nightshift run --task TASK-001 and then that completes and then you go to nightshift run --all it fails on blocked by missing dependencies: TASK-001 . I think this is because the tasks get reset at the top of the run, but there is something marking completion of TASK-001 requiring manual reset.
 run --all should start at the first not done task (seems like it does)
 ## Some kind of tool install feature
 Continually fails on flask_sqlalchemy until I install that.
 ## Tutorial need to include . directory for imageboard
 ## Git status artifacts are noisy for non-git repositories
 Observed artifact:
 ```text
 # Git Status before
 Available: false
 Exit code: 128
 fatal: not a git repository (or any of the parent directories): .git
 ```
 Current behavior:
 - NightShift continues when `require_clean_worktree: false`.
 - `git-status-before.txt`, `git-status-after.txt`, and `diff.patch` may contain git errors.
 - This is technically safe, but confusing for users running quickstart/demo projects outside git.
 Desired behavior:
 - Detect non-git repositories explicitly.
 - Write a clearer artifact message such as:
 ```text
 Git repository: false
 Clean-worktree enforcement: skipped because require_clean_worktree is false
 Diff artifact: unavailable because project is not a git repository
 ```
 - Avoid treating non-git as a scary-looking failure when clean worktree is not required.
 Acceptance criteria:
 - Non-git projects produce readable git artifacts without fatal-looking output.
 - `require_clean_worktree: true` still fails safely in non-git projects.
 - Reports mention that git metadata/diff is unavailable because the project is not a git repo.
 ## Git safe.directory / ownership conflicts on Windows
 Observed context:
 - Git can report dubious ownership or safe-directory errors when a repo was created or managed by a different Windows user identity.
 - This may happen when using GitHub Desktop, WSL, admin shells, or multiple Windows accounts.
 Current behavior:
 - NightShift records the raw git error in artifacts.
 - If `require_clean_worktree: true`, NightShift blocks execution.
 - If `require_clean_worktree: false`, NightShift continues but git status/diff artifacts can look like hard failures.
 Desired behavior:
 - Detect common `dubious ownership` / `safe.directory` messages.
 - Write a clearer explanation in artifacts and reports.
 - Suggest the exact remediation outside NightShift, for example:
 ```powershell
 git config --global --add safe.directory <project-root>
 ```
 Acceptance criteria:
 - Safe-directory failures are classified separately from ordinary git failures.
 - Users get actionable guidance.
 - NightShift does not attempt to change global git config automatically.
 ## Clarify docs around git requirements
 Add to `QUICKSTART.md` and troubleshooting:
 - Git is optional when `require_clean_worktree: false`.
 - Git is required for clean-worktree enforcement and useful diffs.
 - Non-git projects can still run pipelines.
 - Git ownership/safe-directory errors affect git artifacts, not core task execution, unless clean-worktree enforcement is enabled.
 ## Console appears idle during long agent calls
 Current behavior:
 - Long Ollama calls can make `nightshift run` look frozen.
 - Progress is only visible by inspecting `.nightshift/` artifacts or `ollama ps`.
 Desired behavior:
 - Print stage start/finish messages to the console.
 - Include agent id, stage id, task id, and artifact path when available.
 - Do not stream model output yet; just show lifecycle progress.
 Acceptance criteria:
 - User can tell which stage is running.
 - Long-running model calls no longer look like a hung process.
 ## Ollama output can make review stages fail if not structured
 Current behavior:
 - Review stages require `status: pass | fail | retry | escalate`.
 - General-purpose model output may include prose before/after the structured fields.
 - If no valid status is found, the review stage fails.
 Desired behavior:
 - Keep strict structured review parsing, but improve prompt templates and error messages.
 - Artifact should clearly say the review output was unparseable and show the expected contract.
 Acceptance criteria:
 - Failed review parsing is easy to diagnose from `review.md` and `stage-results.md`.
 ## `echo` fake agents do not behave consistently across shells
 Current behavior:
 - Starter templates use `command: echo`.
 - Depending on shell/platform, `echo` may not preserve stdin or may only echo arguments.
 - This can make fake agent artifacts less useful.
 Desired behavior:
 - Replace fake-agent defaults with small Python one-liners or documented fake-agent scripts.
 - Keep examples cross-platform.
 Acceptance criteria:
 - Starter project produces predictable fake-agent output on Windows PowerShell/cmd and Unix shells.
 ## `unittest discover` behavior depends on test package layout
 Current behavior:
 - Python 3.14 returned `NO TESTS RAN` with exit code 5 for an example project until `tests/__init__.py` was added.
 - Users may hit the same issue in fresh target repos.
 Desired behavior:
 - Document this in troubleshooting.
 - Consider making quickstart templates include `tests/__init__.py`.
 Acceptance criteria:
 - Quickstart test command works in a fresh copied example.
 - Troubleshooting mentions what to do if `NO TESTS RAN` appears.
 ## Task completion can mark tasks complete even if no source changed
 Current behavior:
 - A pipeline can pass with fake agents and passing tests, then mark the task complete.
 - This is expected for fake/demo mode but surprising when users expect code edits.
 Desired behavior:
 - Add a warning when a task completes and git/diff detects no source changes, where git is available.
 - Documentation should explain fake-agent mode vs editing-agent mode.
 Acceptance criteria:
 - Users are less likely to mistake artifact generation for code modification.
 ## Dashboard requires Flask but dependency is optional
 Current behavior:
 - `nightshift web` fails with a helpful message if Flask is missing.
 - README mentions `pip install flask`, but install extras are not defined.
 Desired behavior:
 - Add an optional dependency group such as `nightshift[web]` later.
 - Keep graceful error behavior.
 Acceptance criteria:
 - Users have one documented install command for dashboard support.
--- a/docs/config-reference.md
+++ b/docs/config-reference.md
@ -62,11 +62,19 @@ Ollama agent:
 ```yaml
 planner:
  backend: ollama
-  model: qwen2.5-coder:14b
+  model: qwen3-coder:30b
  base_url: http://localhost:11434
  system_prompt: agents/planner.md
  temperature: 0.2
  num_ctx: 8192
  num_predict: 4096
  seed: 1
  stop:
    - STOP
 ```
 Optional Ollama generation options currently supported by NightShift are `temperature`, `num_ctx`, `num_predict`, `seed`, and `stop`.
 ## `pipeline`
 - `max_task_retries`: task retry limit.
@ -76,6 +84,7 @@ planner:
 Command stage options:
 - `commands`: command strings.
 - Command strings may use task placeholders: `{task_id}`, `{task_id_lower}`, `{task_id_slug}`, and `{task_id_compact}`.
 - `shell`: defaults to true. Set false for argv-style execution.
 - `timeout_seconds`: per-stage timeout override.
 - `working_dir`: command working directory inside project root.
@ -141,6 +150,12 @@ Create a local integration sandbox from the NightShift repository root:
 python -m nightshift.cli integ-run --template tutorial-pastebin
 ```
 Create, set up, validate, and run one task from the generated project directory:
 ```bash
 python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
 ```
 Set up the generated Python project:
 ```bash
@ -161,6 +176,12 @@ Preview commands without running them:
 python -m nightshift.cli integ-setup --project integ_runs/<timestamp>/project --dry-run
 ```
 Summarize the latest integration artifact run:
 ```bash
 python -m nightshift.cli integ-report --latest
 ```
 To clean up old sandboxes before creating a new one, keep only the newest three existing runs:
 ```bash
@ -169,8 +190,4 @@ python -m nightshift.cli integ-run --template tutorial-pastebin --keep 3
 ## Pastebin Tutorial
-`nightshift init --template tutorial-pastebin` creates a small Flask snippet-hosting target with deterministic tests and incremental NightShift tasks. Its pipeline includes semantic context retrieval, telemetry, debugger support, and implementation fallback order:
+`nightshift init --template tutorial-pastebin` creates a small Flask snippet-hosting target with deterministic tests and incremental NightShift tasks. Its pipeline includes semantic context retrieval, telemetry, debugger support, fixed task-specific tests, and a single default `qwen3-coder:30b` model path.
 - `qwen2.5-coder:14b`
 - `carstenuhlig/omnicoder-9b`
 - `deepseek-coder-v2:16b`
--- a/docs/future_ideas.md
+++ b/docs/future_ideas.md
@ -0,0 +1,17 @@
 ### Future Ideas
 Not to implement until we get successful long running runs.
 ## I am realizing "templates" are abstracted from the user
 * I think templates will be a first class citizen, a package for deployments, and a harness for performance tests
 * These should live external to nightshift/project_templates as users will likely create their own 
 * one solution would be to reference two directories when looking up templates, builtin ones will be in nightshift/project_templates or users can define a templates directory in their nightshift config
 ## nightshift config
 * store user settings in ~/.nightshift/config.yaml
 * things like templates folder (can also live here)
 * maybe this is later
 ## A way to easily make A/B tests to benchmark models?
 * Right now I can do this manually, for example I want to run the tutorial-pastebin with qwen3.6:27b as the planner and qwen2.5-coder:14b as the coder, and another with qwen3.6:27b as both, etc.
 * Maybe there is a way to make it easier to do that, possibly by creating a template that can be controlled by a larger multi-run file?
 * This is probably for way later.
--- a/docs/ideas.md
+++ b/docs/ideas.md
@ -0,0 +1,366 @@
 # Ideas TODO
 This file is now prioritized inline. Priority scale:
 - P0: do next; directly improves current feedback loop
 - P1: important after the current loop is usable
 - P2: useful, but only after basics are stable
 - P3: defer or maybe reject
 ## P0: Make Integration Tests Easy To Run
 Status: implemented.
 Implemented command:
 ```powershell
 python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
 ```
 It creates the integration sandbox, sets up the venv, runs validation through setup, runs the task from the generated project directory, and prints the artifact root. Use `--dry-run` to preview the setup and task command.
 Running integration tests is still too manual.
 Current process:
 - install the current version of NightShift
 - run `python -m nightshift.cli integ-run --template tutorial-pastebin --setup`
 - copy the activation line from the output and run it
 - `cd` into the generated directory
 - run the task there, because running from the repo root does not find `nightshift.yaml`
 Recommendation: implement a wrapper command, not just a loose script.
 Target command:
 ```powershell
 python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
 ```
 It should:
 1. create the integration run
 2. set up the venv
 3. install NightShift from the current checkout
 4. run `nightshift validate`
 5. run the selected task from the generated project directory
 6. print final status and artifact path
 Useful variants:
 ```powershell
 python -m nightshift.cli integ-test --template tutorial-pastebin --all
 python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-002 --keep 3
 ```
 The base-directory config issue may not be a core bug, but it is bad UX. The wrapper should handle `cwd` correctly.
 ## P0/P1: Remove Multi-Candidate Workflow From Default Pastebin
 Status: implemented for the default pastebin template and tutorial example.
 Original idea:
 - The multi-candidate workflow does not add as much as expected.
 - Keep it as an example, maybe `example-multiagent`.
 Recommendation: yes. Remove it from the default pastebin tutorial.
 Reason:
 - Pastebin is becoming the reliability harness.
 - Multi-candidate fallback makes artifacts harder to reason about.
 - It adds model variability while we are still debugging pipeline behavior.
 Better split:
 ```text
 tutorial-pastebin
 tutorial-pastebin-multiagent
 ```
 or:
 ```text
 examples/templates/multiagent-fallback
 ```
 Default pastebin should be boring:
 ```text
 planner -> semantic_context -> context -> implement -> validate -> test -> review
 ```
 Use one strong implementer first. Add fallback only in a separate experiment template.
 ## P1: Add A Qwen3 / 30B Pastebin Variant
 Status: implemented as the default pastebin model path using `qwen3-coder:30b`.
 Original idea:
 - Use a non-coder model for planner roles.
 - Try `qwen3.6:27b` for planning.
 - Use `qwen3-coder:30b` for implementer and code-heavy roles.
 Recommendation: viable, but make this a variant, not the default.
 kass reply- No lets make this the default. the qwen3-coder:30b is fast now for me for some reason.
 Suggested template/config:
 ```text
 tutorial-pastebin-qwen3
 ```
 Possible role split:
 - planner: `qwen3.6:27b`
 - reviewer/debugger: `qwen3.6:27b`
 - implementer: `qwen3-coder:30b` or exact local 30B coder model name
 Important: confirm exact model names with:
 ```powershell
 ollama list
 ```
 i did its `qwen3-coder:30b`
 Use 30B where it pays:
 - first implementation for hard tasks
 - repair after concrete test failure
 - schema/database changes
 - multi-file changes
 Do not blindly make every stage 30B if it is slow.
 reply: Its not slow now!`qwen3-coder:30b`
 ## P2: Expose More Model Parameters
 Status: implemented for the practical first set.
 Supported optional Ollama fields now include `num_ctx`, `num_predict`, `seed`, and `stop`, in addition to existing `temperature`.
 Original question:
 - What else besides temperature is available?
 - Are any worth optimizing?
 Likely useful for Ollama:
 - `temperature`
 - `num_ctx`
 - `num_predict`
 - `seed`
 - `stop`
 - maybe `top_p`, `top_k`, `repeat_penalty`
 Recommendation: add only a small practical set first.
 Useful config shape:
 ```yaml
 temperature: 0.1
 num_ctx: 8192
 num_predict: 4096
 seed: 1
 ```
 Most useful:
 - `num_ctx`: larger repo/task context
 - `num_predict`: caps runaway output
 - `seed`: reproducibility, if supported consistently
 - `temperature`: already useful; keep low for code
 - `stop`: could help enforce file-block or diff-only contracts
 Defer tuning `top_p`, `top_k`, and `repeat_penalty` unless a specific model needs it.
 reply: yup lets put this in the nightshift.yaml (optional parameters, if they arent in there that's fine, but we should offer them.)
 ## P1: Add Test Governance For Generated Tests
 Original idea:
 - Have a test governance layer for when agents write tests.
 - A reviewer validates alignment with acceptance criteria.
 Recommendation: yes, but only for generated-test mode. Do not put generated tests back into default pastebin yet.
 The previous failures proved test-writing agents will:
 - edit app code
 - import nonexistent modules
 - require undeclared dependencies
 - inspect implementation internals
 - write tests for future behavior
 Governance should be deterministic first, model-reviewed second.
 Deterministic checks:
 - test-writing stage may only touch `tests/`
 - tests compile
 - tests import only allowed public interfaces
 - tests do not import undeclared dependencies
 - tests do not define Flask routes or app implementation
 - test names match current task id or current artifact
 - no future-task keywords unless accepted by current task AC
 Then optional model reviewer checks acceptance-criteria alignment.
 ## P2: Add A Test Analyzer Agent For TDD
 Original idea:
 - Analyze tests.
 - Translate them into direct instructions for the implementer.
 - Maybe implement using agent YAML definitions without new NightShift features.
 Recommendation: viable, but defer until generated tests are stable.
 Possible pipeline:
 ```text
 write_tests -> validate_tests -> analyze_tests -> implement
 ```
 Analyzer output should be concrete:
 ```text
 Implementation requirements:
 - create_app(database_path) must return a Flask app.
 - POST /snippets must return 201 and JSON id.
 - GET /snippets/<id> must return persisted fields.
 Do not modify:
 - tests/test_task001.py
 ```
 This may help smaller models, but it is another model output that can be wrong. Add it only after the fixed-test pipeline works through all pastebin tasks.
 ## P2/P3: Add A Test Planner
 Original idea:
 - A test planner understands acceptance criteria and code.
 - Provides input to the next stage about constraints and code, especially for non-TDD.
 Recommendation: maybe, but defer.
 This overlaps with:
 - planner
 - test analyzer
 - test governance
 Too many planning-ish stages can make the pipeline bloated and contradictory.
 If implemented later, keep it focused:
 ```text
 test_planner -> write_tests -> test_governance -> implement
 ```
 For now, fold this idea into the future test governance/analyzer work.
 ## P1: Add Fixed Tests For All Pastebin Tasks
 Status: mostly implemented in the template.
 Current fixed tests:
 ```text
 tests/test_task001.py
 tests/test_task002.py
 tests/test_task003.py
 tests/test_task004.py
 tests/test_task005.py
 ```
 Important design:
 ```yaml
 python -m pytest -q tests/test_{task_id_compact}.py
 ```
 This lets all future task tests exist without breaking earlier tasks.
 Next step: validate these through integration runs, one task at a time.
 ## P1: Add `nightshift integ-report`
 Status: implemented as a first-pass artifact summarizer.
 New idea.
 Summarize latest integration run across tasks:
 ```text
 TASK-001 complete in 1 retry
 TASK-002 failed at validate_patch
 Root cause: protected tests modified
 Artifacts: ...
 ```
 Right now we inspect artifacts manually. NightShift should do more of that.
 Possible command:
 ```powershell
 python -m nightshift.cli integ-report --latest
 ```
 ## P1: Add Task-Test Preflight To `validate`
 Status: implemented.
 `nightshift validate` now renders task command placeholders for every task and fails early if a configured `tests/test_*.py` path is missing.
 Partially implemented at run time.
 Current behavior:
 - task command placeholders can render paths like `tests/test_task002.py`
 - `run_task` preflight fails before invoking agents if the task-specific test file is missing
 Better behavior:
 ```powershell
 nightshift validate
 ```
 should warn or fail:
 ```text
 TASK-003 expects tests/test_task003.py and it exists.
 TASK-004 expects tests/test_task004.py and it exists.
 ```
 This catches missing fixed tests earlier.
 ## P2: Add Run Comparison
 New idea.
 Useful once comparing 14B vs 30B:
 ```powershell
 nightshift compare-runs --latest 5
 ```
 Show:
 - model
 - task
 - retries
 - failure stage
 - final reason
 - runtime
 - token estimate
 This should come after `integ-test` and `integ-report`.
--- a/examples/tutorial/03-pastebin/README.md
+++ b/examples/tutorial/03-pastebin/README.md
@ -1,4 +1,4 @@
-# Tutorial 03: Pastebin With Model Fallback And Telemetry
+# Tutorial 03: Pastebin With Fixed Tests And Telemetry
 This tutorial uses the `tutorial-pastebin` template: a small Flask snippet-hosting service designed for deterministic NightShift orchestration tests.
@ -19,6 +19,12 @@ For an isolated local integration run, use the integration sandbox command from
 python -m nightshift.cli integ-run --template tutorial-pastebin
 ```
 To create, set up, validate, and run one task in a single command:
 ```bash
 python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
 ```
 To create the sandbox and set up the Python project immediately:
 ```bash
@ -57,7 +63,7 @@ pyproject.toml
 README.md
 ```
-The template includes a tiny Flask `create_app(database_path=None)` scaffold and fixed `TASK-001` tests. The default tutorial pipeline asks the implementation agent to make those deterministic tests pass before review.
+The template includes a tiny Flask `create_app(database_path=None)` scaffold and fixed tests for each tutorial task. The default tutorial pipeline asks the implementation agent to make only the current task's deterministic tests pass before review.
 ## Prerequisites
@ -73,26 +79,22 @@ Install target dependencies:
 python -m pip install -e . pytest flask
 ```
-Install and start Ollama, then pull the fallback models you want available:
+Install and start Ollama, then pull the default pastebin model:
 ```bash
-ollama pull qwen2.5-coder:14b
+ollama pull qwen3-coder:30b
 ollama pull carstenuhlig/omnicoder-9b
 ollama pull deepseek-coder-v2:16b
 ollama list
 ```
 NightShift uses Ollama's local HTTP API, normally at `http://localhost:11434`.
-## Model Fallback
+## Model
-The implementation stage uses this fallback order:
+The default pastebin pipeline uses one strong local coder model:
-1. `qwen2.5-coder:14b`
+- `qwen3-coder:30b`
 2. `carstenuhlig/omnicoder-9b`
 3. `deepseek-coder-v2:16b`
-NightShift records which agent/model handled each stage in `telemetry-summary.md`.
+NightShift records which agent/model handled each stage in `telemetry-summary.md`. Multi-candidate fallback belongs in a separate experiment template, not the default pastebin reliability harness.
 ## TDD Pipeline
--- a/examples/tutorial/03-pastebin/nightshift.yaml
+++ b/examples/tutorial/03-pastebin/nightshift.yaml
@ -20,51 +20,49 @@ safety:
    - curl | bash
 experiment:
-  label: pastebin-model-fallback
+  label: pastebin-qwen3-coder
-  prompt_variant: tdd-qwen-omnicoder-deepseek-v2
+  prompt_variant: fixed-tests-qwen3-coder-30b-v1
 agents:
  planner:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.2
    num_ctx: 8192
    num_predict: 4096
    system_prompt: .nightshift/agents/planner.md
-  implementer_qwen:
+  implementer:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.1
    num_ctx: 8192
    num_predict: 4096
    system_prompt: .nightshift/agents/implementer.md
  test_writer:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.1
    num_ctx: 8192
    num_predict: 4096
    system_prompt: .nightshift/agents/test-writer.md
  implementer_omnicoder:
    backend: ollama
    model: carstenuhlig/omnicoder-9b
    temperature: 0.1
    system_prompt: .nightshift/agents/implementer.md
  implementer_deepseek:
    backend: ollama
    model: deepseek-coder-v2:16b
    temperature: 0.1
    system_prompt: .nightshift/agents/implementer.md
  debugger:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    role: debugger
    temperature: 0.1
    num_ctx: 8192
    num_predict: 4096
    system_prompt: .nightshift/agents/debugger.md
  reviewer:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.1
    num_ctx: 8192
    num_predict: 4096
    system_prompt: .nightshift/agents/reviewer.md
 pipeline:
@ -87,10 +85,7 @@ pipeline:
    - id: implement
      type: file_writer
-      agent_pool:
+      agent: implementer
        - implementer_qwen
        - implementer_omnicoder
        - implementer_deepseek
      output: proposed.patch
    - id: normalize
--- a/nightshift/agents.py
+++ b/nightshift/agents.py
@ -228,8 +228,9 @@ class AgentExecutor:
            "prompt": prompt,
            "stream": False,
        }
-        if agent.temperature is not None:
+        options = _ollama_options(agent)
-            body["options"] = {"temperature": agent.temperature}
+        if options:
            body["options"] = options
        headers = {"Content-Type": "application/json"}
        started = time.monotonic()
        self.logger.event(
@ -395,6 +396,21 @@ def build_prompt_bundle(
    )
 def _ollama_options(agent: AgentConfig) -> dict[str, object]:
    options: dict[str, object] = {}
    if agent.temperature is not None:
        options["temperature"] = agent.temperature
    if agent.num_ctx is not None:
        options["num_ctx"] = agent.num_ctx
    if agent.num_predict is not None:
        options["num_predict"] = agent.num_predict
    if agent.seed is not None:
        options["seed"] = agent.seed
    if agent.stop:
        options["stop"] = list(agent.stop)
    return options
 def _coerce_output(value: str | bytes | None) -> str:
    if value is None:
        return ""
--- a/nightshift/cli.py
+++ b/nightshift/cli.py
@ -7,13 +7,16 @@ from pathlib import Path
 import sys
 from .config import validate_config
-from .errors import NightShiftError
+from .errors import ConfigError, NightShiftError
 from .init import available_templates, init_project
 from .integ import create_integration_run
 from .integ_report import build_integration_report, format_integration_report
 from .integ_setup import format_setup_result, setup_python_project
 from .integ_test import format_integration_test_result, run_integration_test
 from .pipeline import PipelineRunner
 from .runlog import RunLogger
 from .status import build_status, format_status
 from .task_tests import check_task_test_files, format_task_test_checks, missing_task_test_paths
 from .terminal import HOTDOG_ANIMATIONS, TerminalAnimation, format_banner, style_text
 from .tasks import (
    ensure_dependencies_satisfied,
@ -105,6 +108,33 @@ def build_parser() -> argparse.ArgumentParser:
        help="Print --setup commands without running them.",
    )
    integ_test_parser = subparsers.add_parser(
        "integ-test",
        help="Create, set up, validate, and run an integration template task.",
    )
    integ_test_parser.add_argument("--root", default=".", help="Repository root where integ_runs/ is created.")
    integ_test_parser.add_argument(
        "--template",
        default="tutorial-pastebin",
        choices=available_templates(),
        help="Template to initialize inside the sandbox.",
    )
    integ_test_parser.add_argument("--task", help="Specific task id to run.")
    integ_test_parser.add_argument("--all", action="store_true", help="Run all runnable incomplete tasks.")
    integ_test_parser.add_argument("--keep", type=int, help="Keep only the newest N old integration runs before creating a new one.")
    integ_test_parser.add_argument(
        "--setup-extra",
        action="append",
        default=["pytest"],
        help="Extra package to install during setup. May be repeated. Defaults to pytest.",
    )
    integ_test_parser.add_argument("--setup-skip-validate", action="store_true", help="Skip validation during setup.")
    integ_test_parser.add_argument("--dry-run", action="store_true", help="Print commands without running setup or tasks.")
    integ_report_parser = subparsers.add_parser("integ-report", help="Summarize the latest integration run.")
    integ_report_parser.add_argument("--root", default=".", help="Repository root where integ_runs/ is located.")
    integ_report_parser.add_argument("--latest", action="store_true", help="Report the latest integration run.")
    setup_parser = subparsers.add_parser(
        "integ-setup",
        help="Set up a Python integration project venv and dependencies.",
@ -160,12 +190,18 @@ def main(argv: list[str] | None = None) -> int:
            config = validate_config(args.config)
            tasks = parse_task_file(config.project.root, config.project.task_file)
            validate_task_dependencies(tasks)
            task_test_checks = check_task_test_files(config, tasks)
            missing_task_tests = missing_task_test_paths(task_test_checks)
            if missing_task_tests:
                details = format_task_test_checks(task_test_checks)
                raise ConfigError(f"Config error: missing configured task test files.\n{details}")
            incomplete = sum(1 for task in tasks if not task.completed)
            print(f"Config valid: {config.path}")
            print(f"Project: {config.project.name}")
            print(f"Stages: {len(config.pipeline.stages)}")
            print(f"Tasks: {len(tasks)}")
            print(f"Incomplete tasks: {incomplete}")
            print(format_task_test_checks(task_test_checks))
            return 0
        if args.command == "run":
@ -256,6 +292,25 @@ def main(argv: list[str] | None = None) -> int:
            print(format_setup_result(result))
            return 0
        if args.command == "integ-test":
            result = run_integration_test(
                args.root,
                template=args.template,
                task=args.task,
                all_tasks=args.all,
                keep=args.keep,
                setup_extras=tuple(args.setup_extra or ()),
                skip_setup_validate=args.setup_skip_validate,
                dry_run=args.dry_run,
            )
            print(format_integration_test_result(result))
            return result.exit_code
        if args.command == "integ-report":
            report = build_integration_report(args.root, latest=True)
            print(format_integration_report(report))
            return 0
    except NightShiftError as exc:
        print(str(exc), file=sys.stderr)
        return 1
--- a/nightshift/commands.py
+++ b/nightshift/commands.py
@ -5,6 +5,7 @@ from __future__ import annotations
 from dataclasses import dataclass
 import os
 from pathlib import Path
 import re
 import shlex
 import subprocess
 import sys
@ -68,11 +69,16 @@ class CommandExecutor:
                command_index=index,
                command=command,
            )
            rendered_command = render_command_template(command, task_id)
            rendered_allowed_commands = tuple(
                render_command_template(allowed, task_id) for allowed in self.safety.allowed_commands
            )
            run = self.run_command(
-                command,
+                rendered_command,
                shell=stage.shell,
                timeout_seconds=stage.timeout_seconds,
                working_dir=stage.working_dir,
                allowed_commands=rendered_allowed_commands,
            )
            runs.append(run)
            self.logger.event(
@ -120,11 +126,12 @@ class CommandExecutor:
        shell: bool = True,
        timeout_seconds: int | None = None,
        working_dir: Path | None = None,
        allowed_commands: tuple[str, ...] | None = None,
    ) -> CommandRun:
        try:
            normalized = ensure_command_allowed(
                command,
-                self.safety.allowed_commands,
+                allowed_commands if allowed_commands is not None else self.safety.allowed_commands,
                self.safety.forbidden_commands,
            )
        except SafetyError as exc:
@ -210,6 +217,27 @@ def format_command_runs(stage_id: str, runs: list[CommandRun]) -> str:
    return "\n".join(lines)
 def render_command_template(command: str, task_id: str) -> str:
    task_id_lower = task_id.lower()
    task_id_slug = task_id_lower.replace("-", "_")
    task_id_compact = task_id_lower.replace("-", "")
    return command.format(
        task_id=task_id,
        task_id_lower=task_id_lower,
        task_id_slug=task_id_slug,
        task_id_compact=task_id_compact,
    )
 def extract_test_file_paths(command: str) -> tuple[str, ...]:
    paths: list[str] = []
    for match in re.finditer(r"(?<![\w./\\-])(tests[\\/][^\s`'\"<>|&;]+\.py)", command):
        path = match.group(1).replace("\\", "/")
        if path not in paths:
            paths.append(path)
    return tuple(paths)
 def _coerce_output(value: str | bytes | None) -> str:
    if value is None:
        return ""
--- a/nightshift/config.py
+++ b/nightshift/config.py
@ -46,6 +46,10 @@ class AgentConfig:
    temperature: float | None = None
    base_url: str | None = None
    api_key_env: str | None = None
    num_ctx: int | None = None
    num_predict: int | None = None
    seed: int | None = None
    stop: tuple[str, ...] = ()
@dataclass(frozen=True)
@ -207,10 +211,18 @@ def parse_config(raw: dict[str, Any], config_path: Path) -> NightShiftConfig:
            agent_raw.get("temperature"),
            f"agents.{agent_id}.temperature",
        )
        num_ctx = _optional_int_or_none(agent_raw.get("num_ctx"), f"agents.{agent_id}.num_ctx")
        num_predict = _optional_int_or_none(agent_raw.get("num_predict"), f"agents.{agent_id}.num_predict")
        seed = _optional_int_or_none(agent_raw.get("seed"), f"agents.{agent_id}.seed")
        stop = _string_tuple(agent_raw.get("stop", []), f"agents.{agent_id}.stop")
        if temperature is not None and temperature < 0:
            raise ConfigError(
                f"Config error: agents.{agent_id}.temperature must be zero or greater."
            )
        if num_ctx is not None and num_ctx <= 0:
            raise ConfigError(f"Config error: agents.{agent_id}.num_ctx must be greater than zero.")
        if num_predict is not None and num_predict <= 0:
            raise ConfigError(f"Config error: agents.{agent_id}.num_predict must be greater than zero.")
        if backend not in {"command", "ollama", "openai_compatible"}:
            raise ConfigError(
                f"Config error: agent '{agent_id}' uses unsupported backend '{backend}'. "
@ -243,6 +255,10 @@ def parse_config(raw: dict[str, Any], config_path: Path) -> NightShiftConfig:
            temperature=temperature,
            base_url=base_url,
            api_key_env=api_key_env,
            num_ctx=num_ctx,
            num_predict=num_predict,
            seed=seed,
            stop=stop,
        )
    experiment_raw = raw.get("experiment", {})
--- a/nightshift/integ_report.py
+++ b/nightshift/integ_report.py
@ -0,0 +1,71 @@
 """Summarize integration run artifacts."""
 from __future__ import annotations
 from dataclasses import dataclass
 from pathlib import Path
 import re
 from .errors import NightShiftError
@dataclass(frozen=True)
 class IntegrationReport:
    integration_run: Path
    nightshift_run: Path | None
    lines: tuple[str, ...]
 def build_integration_report(root: str | Path = ".", *, latest: bool = True) -> IntegrationReport:
    base = Path(root).resolve() / "integ_runs"
    if not base.exists():
        raise NightShiftError(f"Integration report error: no integ_runs directory found: {base}")
    runs = sorted((path for path in base.iterdir() if path.is_dir()), key=lambda path: path.name, reverse=True)
    if not runs:
        raise NightShiftError(f"Integration report error: no integration runs found under: {base}")
    integration_run = runs[0] if latest else runs[0]
    artifacts_root = integration_run / "project" / ".nightshift" / "runs"
    if not artifacts_root.exists():
        return IntegrationReport(
            integration_run,
            None,
            ("No NightShift run artifacts found. Setup may have failed before task execution.",),
        )
    nightshift_runs = sorted((path for path in artifacts_root.iterdir() if path.is_dir()), key=lambda path: path.name, reverse=True)
    if not nightshift_runs:
        return IntegrationReport(integration_run, None, ("No NightShift run directories found.",))
    nightshift_run = nightshift_runs[0]
    summaries = sorted(nightshift_run.glob("tasks/*/run-summary.md"))
    if not summaries and (nightshift_run / "run-summary.md").exists():
        summaries = [nightshift_run / "run-summary.md"]
    lines = [_summarize_run_summary(path, integration_run) for path in summaries]
    return IntegrationReport(integration_run, nightshift_run, tuple(lines or ("No task summaries found.",)))
 def format_integration_report(report: IntegrationReport) -> str:
    lines = [f"Integration run: {report.integration_run}"]
    if report.nightshift_run is not None:
        lines.append(f"NightShift run: {report.nightshift_run}")
    lines.append("")
    lines.extend(f"- {line}" for line in report.lines)
    return "\n".join(lines)
 def _summarize_run_summary(path: Path, integration_run: Path) -> str:
    text = path.read_text(encoding="utf-8", errors="replace")
    task = _field(text, "Task") or path.parent.name
    status = _field(text, "Status") or "unknown"
    retries = _field(text, "Retry count") or "unknown"
    reason = _field(text, "Reason") or "no reason recorded"
    try:
        relative = path.relative_to(integration_run)
    except ValueError:
        relative = path
    return f"{task} {status} after {retries} retries. Reason: {reason}. Artifacts: {relative.parent}"
 def _field(text: str, name: str) -> str | None:
    match = re.search(rf"^- {re.escape(name)}:\s*(.+)$", text, flags=re.MULTILINE)
    if not match:
        return None
    return match.group(1).strip()
--- a/nightshift/integ_test.py
+++ b/nightshift/integ_test.py
@ -0,0 +1,71 @@
 """End-to-end integration test wrapper."""
 from __future__ import annotations
 from dataclasses import dataclass
 from pathlib import Path
 import subprocess
 from .errors import NightShiftError
 from .integ import IntegrationRun, create_integration_run
 from .integ_setup import IntegrationSetupResult, setup_python_project
@dataclass(frozen=True)
 class IntegrationTestResult:
    run: IntegrationRun
    setup: IntegrationSetupResult
    command: tuple[str, ...]
    exit_code: int
    dry_run: bool
 def run_integration_test(
    root: str | Path = ".",
    *,
    template: str = "tutorial-pastebin",
    task: str | None = None,
    all_tasks: bool = False,
    keep: int | None = None,
    setup_extras: tuple[str, ...] = ("pytest",),
    skip_setup_validate: bool = False,
    dry_run: bool = False,
 ) -> IntegrationTestResult:
    if task and all_tasks:
        raise NightShiftError("Integration test error: use either --task or --all, not both.")
    if not task and not all_tasks:
        raise NightShiftError("Integration test error: provide --task or --all.")
    run = create_integration_run(Path(root), template=template, keep=keep)
    project = run.directory / "project"
    setup = setup_python_project(
        project,
        extras=setup_extras,
        validate=not skip_setup_validate,
        dry_run=dry_run,
    )
    command = [str(setup.python), "-m", "nightshift.cli", "run", "--no-animation"]
    if all_tasks:
        command.append("--all")
    else:
        command.extend(["--task", task or ""])
    exit_code = 0
    if not dry_run:
        completed = subprocess.run(command, cwd=project, text=True, encoding="utf-8", errors="replace")
        exit_code = completed.returncode
    return IntegrationTestResult(run, setup, tuple(command), exit_code, dry_run)
 def format_integration_test_result(result: IntegrationTestResult) -> str:
    lines = [
        f"Integration run: {result.run.directory}",
        f"Project: {result.run.directory / 'project'}",
        f"Venv: {result.run.venv_dir}",
        f"Run command: {' '.join(result.command)}",
        f"Exit code: {result.exit_code}",
        f"Artifacts: {result.run.directory / 'project' / '.nightshift'}",
    ]
    if result.dry_run:
        lines.insert(3, "Dry run: true")
    return "\n".join(lines)
--- a/nightshift/pipeline.py
+++ b/nightshift/pipeline.py
@ -9,7 +9,7 @@ import subprocess
 from .agents import AgentExecutor
 from .artifacts import ArtifactStore
-from .commands import CommandExecutor
+from .commands import CommandExecutor, extract_test_file_paths, render_command_template
 from .config import COMMAND_STAGE_TYPES, NightShiftConfig, StageConfig
 from .context import ContextManager
 from .dependencies import diagnose_python_dependencies, format_dependency_diagnostic
@ -145,6 +145,12 @@ class PipelineRunner:
        index = 0
        final_status = "complete"
        final_reason = "Pipeline completed."
        preflight_result = self._preflight_task(task, stages)
        if preflight_result:
            stage_results.append(preflight_result)
            final_status = "failed"
            final_reason = preflight_result.reason
            index = len(stages)
        while index < len(stages):
            stage = stages[index]
@ -248,6 +254,13 @@ class PipelineRunner:
                    "retry-memory.md",
                    summarize_retry_memory(tuple(retry_memory)),
                )
                if _repeated_protected_path_violation(tuple(retry_memory)):
                    final_status = "failed"
                    final_reason = (
                        "Escalation policy stopped retries: implementation repeatedly "
                        "attempted to modify paths outside the stage allowlist."
                    )
                    break
                decision = evaluate_retry_churn(
                    tuple(retry_memory),
                    retry_budget=self.config.pipeline.max_task_retries + 1,
@ -334,6 +347,45 @@ class PipelineRunner:
            reason=final_reason,
        )
    def _preflight_task(self, task: Task, stages: list[StageConfig]) -> StageResult | None:
        missing_paths: list[str] = []
        for stage in stages:
            if stage.type not in COMMAND_STAGE_TYPES:
                continue
            for command in stage.commands:
                rendered = render_command_template(command, task.id)
                for path_text in extract_test_file_paths(rendered):
                    if not (self.config.project.root / path_text).exists():
                        missing_paths.append(path_text)
        if not missing_paths:
            return None
        unique_paths = tuple(dict.fromkeys(missing_paths))
        details = "\n".join(f"- `{path}`" for path in unique_paths)
        output_path = self.artifacts.write_stage_output(
            task.id,
            "preflight.md",
            "\n".join(
                [
                    "# Task Preflight",
                    "",
                    "Status: fail",
                    "Reason: configured task test file is missing.",
                    "",
                    "## Missing Files",
                    "",
                    details,
                    "",
                ]
            ),
        )
        return StageResult(
            "preflight",
            "fail",
            "Task preflight failed: configured task test file is missing: "
            + ", ".join(unique_paths),
            output_path=str(output_path.relative_to(self.config.project.root)),
        )
    def run_tasks(self, tasks: list[Task] | tuple[Task, ...]) -> MultiTaskResult:
        self.artifacts.initialize_run()
        self.logger.bind(self.artifacts)
@ -1428,6 +1480,18 @@ def _extract_exit_code(text: str) -> int | None:
        return None
 def _repeated_protected_path_violation(entries: tuple[RetryMemoryEntry, ...]) -> bool:
    recent = entries[-2:]
    if len(recent) < 2:
        return False
    return all(_is_protected_path_violation(entry.cause) for entry in recent)
 def _is_protected_path_violation(text: str) -> bool:
    lowered = text.lower()
    return "not allowed for this stage" in lowered and "tests/" in lowered.replace("\\", "/")
 def format_aggregate_run_summary(results: list[PipelineResult], status: str, reason: str) -> str:
    lines = [
        "# Run Summary",
--- a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/debugger.md
+++ b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/debugger.md
@ -1,9 +1,11 @@
 You are the debugger agent for the NightShift pastebin tutorial.
 Diagnose failed attempts without editing files.
-Distinguish inaccurate generated tests from implementation bugs.
+Distinguish fixed-test/template problems from implementation bugs.
-If tests are inaccurate for the current task, recommend retrying `write_tests`.
+This tutorial uses fixed task tests and task-specific pytest commands. Do not recommend `write_tests` unless the configured pipeline actually has a `write_tests` stage.
 If a current task appears to lack tests, report a template or test-selection problem.
 If implementation is wrong, recommend the smallest implementation repair and name files that should not be modified.
 Implementation agents must not edit files under `tests/`.
 Return:
 - concise diagnosis
 - recommended next action
--- a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/implementer.md
+++ b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/implementer.md
@ -7,8 +7,10 @@ Do not add behavior for future tasks unless needed to satisfy the current tests.
 Use Flask and `sqlite3` from the Python standard library. Do not use SQLAlchemy, Flask-SQLAlchemy, or undeclared dependencies.
 Keep the public package name `pastebin_app`.
 Keep the public app entry point `create_app(database_path: str | None = None)`.
 Respect `database_path`; do not hard-code `snippets.db` when a database path is supplied.
 Tests should interact through HTTP routes and `create_app`, not through ORM/session globals.
 Do not use `app.before_first_request`; recent Flask versions removed it. Initialize required database tables inside `create_app` or inside the route helper before use.
 When adding columns to an existing sqlite table, handle existing databases idempotently with `ALTER TABLE` checks or a simple migration helper. `CREATE TABLE IF NOT EXISTS` does not add columns to an existing table.
 Output only complete file content blocks.
 Use one fenced block per file:
--- a/nightshift/project_templates/tutorial-pastebin/README.md
+++ b/nightshift/project_templates/tutorial-pastebin/README.md
@ -14,6 +14,12 @@ Or create an isolated integration sandbox from the NightShift repository root:
 python -m nightshift.cli integ-run --template tutorial-pastebin
 ```
 To create, set up, validate, and run one task in a single command:
 ```bash
 python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
 ```
 To create the sandbox and set it up in one step:
 ```bash
@ -48,12 +54,8 @@ nightshift what-happened
 When running from an integration sandbox, the same commands are run inside `integ_runs/<timestamp>/project`.
-The pipeline uses model fallback ordering for implementation attempts:
+The default pastebin pipeline uses `qwen3-coder:30b` for planning, implementation, debugging, test review, and final review. It intentionally does not use multi-candidate fallback; pastebin is the deterministic reliability harness.
 1. `qwen2.5-coder:14b`
 2. `carstenuhlig/omnicoder-9b`
 3. `deepseek-coder-v2:16b`
 Telemetry artifacts record which agent/model handled each stage and estimate token usage.
-This template uses a TDD-oriented pipeline. It starts with a skeletal package, generates task-specific pytest tests from the current task acceptance criteria, reviews those tests for scope, and then implements only enough application code to pass them.
+This template uses fixed task-specific pytest files. The pipeline starts with a skeletal package, implements only the current task, runs `tests/test_{task_id_compact}.py`, and then reviews the result.
--- a/nightshift/project_templates/tutorial-pastebin/nightshift.yaml
+++ b/nightshift/project_templates/tutorial-pastebin/nightshift.yaml
@ -20,51 +20,49 @@ safety:
    - curl | bash
 experiment:
-  label: pastebin-model-fallback
+  label: pastebin-qwen3-coder
-  prompt_variant: tdd-qwen-omnicoder-deepseek-v2
+  prompt_variant: fixed-tests-qwen3-coder-30b-v1
 agents:
  planner:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.2
    num_ctx: 8192
    num_predict: 4096
    system_prompt: .nightshift/agents/planner.md
-  implementer_qwen:
+  implementer:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.1
    num_ctx: 8192
    num_predict: 4096
    system_prompt: .nightshift/agents/implementer.md
  test_writer:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.1
    num_ctx: 8192
    num_predict: 4096
    system_prompt: .nightshift/agents/test-writer.md
  implementer_omnicoder:
    backend: ollama
    model: carstenuhlig/omnicoder-9b
    temperature: 0.1
    system_prompt: .nightshift/agents/implementer.md
  implementer_deepseek:
    backend: ollama
    model: deepseek-coder-v2:16b
    temperature: 0.1
    system_prompt: .nightshift/agents/implementer.md
  debugger:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    role: debugger
    temperature: 0.1
    num_ctx: 8192
    num_predict: 4096
    system_prompt: .nightshift/agents/debugger.md
  reviewer:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.1
    num_ctx: 8192
    num_predict: 4096
    system_prompt: .nightshift/agents/reviewer.md
 pipeline:
@ -87,10 +85,7 @@ pipeline:
    - id: implement
      type: file_writer
-      agent_pool:
+      agent: implementer
        - implementer_qwen
        - implementer_omnicoder
        - implementer_deepseek
      output: proposed.patch
    - id: normalize
--- a/nightshift/project_templates/tutorial-pastebin/tests/test_task001.py
+++ b/nightshift/project_templates/tutorial-pastebin/tests/test_task001.py
@ -16,6 +16,7 @@ def test_create_snippet_returns_created_snippet_id(tmp_path):
    assert response.status_code == 201
    data = response.get_json()
    assert isinstance(data["id"], int)
    assert (tmp_path / "snippets.db").exists()
 def test_view_snippet_returns_persisted_fields(tmp_path):
@ -38,6 +39,7 @@ def test_view_snippet_returns_persisted_fields(tmp_path):
        "title": "View me",
        "body": "stored body",
    }
    assert (tmp_path / "snippets.db").exists()
 def test_view_missing_snippet_returns_404(tmp_path):
--- a/nightshift/project_templates/tutorial-pastebin/tests/test_task002.py
+++ b/nightshift/project_templates/tutorial-pastebin/tests/test_task002.py
@ -0,0 +1,50 @@
 from pastebin_app.app import create_app
 def test_create_snippet_accepts_optional_metadata(tmp_path):
    app = create_app(database_path=str(tmp_path / "snippets.db"))
    client = app.test_client()
    response = client.post(
        "/snippets",
        json={
            "title": "Tagged",
            "body": "metadata body",
            "language": "python",
            "tags": ["alpha", "beta"],
            "expires_at": "2030-01-01T00:00:00",
        },
    )
    assert response.status_code == 201
    assert isinstance(response.get_json()["id"], int)
    assert (tmp_path / "snippets.db").exists()
 def test_view_snippet_returns_optional_metadata(tmp_path):
    app = create_app(database_path=str(tmp_path / "snippets.db"))
    client = app.test_client()
    created = client.post(
        "/snippets",
        json={
            "title": "Tagged",
            "body": "metadata body",
            "language": "python",
            "tags": ["alpha", "beta"],
            "expires_at": "2030-01-01T00:00:00",
        },
    ).get_json()
    response = client.get(f"/snippets/{created['id']}")
    assert response.status_code == 200
    assert response.get_json() == {
        "id": created["id"],
        "title": "Tagged",
        "body": "metadata body",
        "language": "python",
        "tags": ["alpha", "beta"],
        "expires_at": "2030-01-01T00:00:00",
    }
    assert (tmp_path / "snippets.db").exists()
--- a/nightshift/project_templates/tutorial-pastebin/tests/test_task003.py
+++ b/nightshift/project_templates/tutorial-pastebin/tests/test_task003.py
@ -0,0 +1,47 @@
 from pastebin_app.app import create_app
 def _create(client, title, body, **metadata):
    response = client.post("/snippets", json={"title": title, "body": body, **metadata})
    assert response.status_code == 201
    return response.get_json()["id"]
 def test_list_snippets_newest_first(tmp_path):
    app = create_app(database_path=str(tmp_path / "snippets.db"))
    client = app.test_client()
    first_id = _create(client, "First", "older")
    second_id = _create(client, "Second", "newer")
    response = client.get("/snippets")
    assert response.status_code == 200
    ids = [snippet["id"] for snippet in response.get_json()]
    assert ids[:2] == [second_id, first_id]
 def test_search_filters_by_title_or_body(tmp_path):
    app = create_app(database_path=str(tmp_path / "snippets.db"))
    client = app.test_client()
    _create(client, "Python note", "ordinary body")
    _create(client, "Other", "contains needle")
    response = client.get("/snippets?q=python")
    assert [snippet["title"] for snippet in response.get_json()] == ["Python note"]
    response = client.get("/snippets?q=needle")
    assert [snippet["title"] for snippet in response.get_json()] == ["Other"]
 def test_language_and_tag_filters(tmp_path):
    app = create_app(database_path=str(tmp_path / "snippets.db"))
    client = app.test_client()
    _create(client, "Python", "body", language="python", tags=["code", "demo"])
    _create(client, "Text", "body", language="text", tags=["notes"])
    response = client.get("/snippets?language=python")
    assert [snippet["title"] for snippet in response.get_json()] == ["Python"]
    response = client.get("/snippets?tag=notes")
    assert [snippet["title"] for snippet in response.get_json()] == ["Text"]
--- a/nightshift/project_templates/tutorial-pastebin/tests/test_task004.py
+++ b/nightshift/project_templates/tutorial-pastebin/tests/test_task004.py
@ -0,0 +1,43 @@
 from pastebin_app.app import create_app
 def test_expired_snippets_are_excluded_from_listing(tmp_path):
    app = create_app(database_path=str(tmp_path / "snippets.db"))
    client = app.test_client()
    client.post(
        "/snippets",
        json={"title": "Expired", "body": "old", "expires_at": "2000-01-01T00:00:00"},
    )
    active = client.post(
        "/snippets",
        json={"title": "Active", "body": "new", "expires_at": "2999-01-01T00:00:00"},
    ).get_json()
    response = client.get("/snippets")
    assert response.status_code == 200
    assert [snippet["id"] for snippet in response.get_json()] == [active["id"]]
 def test_direct_lookup_of_expired_snippet_returns_410(tmp_path):
    app = create_app(database_path=str(tmp_path / "snippets.db"))
    client = app.test_client()
    expired = client.post(
        "/snippets",
        json={"title": "Expired", "body": "old", "expires_at": "2000-01-01T00:00:00"},
    ).get_json()
    response = client.get(f"/snippets/{expired['id']}")
    assert response.status_code == 410
 def test_non_expiring_snippet_remains_visible(tmp_path):
    app = create_app(database_path=str(tmp_path / "snippets.db"))
    client = app.test_client()
    created = client.post("/snippets", json={"title": "Forever", "body": "body"}).get_json()
    response = client.get(f"/snippets/{created['id']}")
    assert response.status_code == 200
    assert response.get_json()["title"] == "Forever"
--- a/nightshift/project_templates/tutorial-pastebin/tests/test_task005.py
+++ b/nightshift/project_templates/tutorial-pastebin/tests/test_task005.py
@ -0,0 +1,46 @@
 from pastebin_app.app import create_app
 def test_root_shows_snippet_list_html(tmp_path):
    app = create_app(database_path=str(tmp_path / "snippets.db"))
    client = app.test_client()
    client.post("/snippets", json={"title": "Visible", "body": "body"})
    response = client.get("/")
    assert response.status_code == 200
    assert "Visible" in response.get_data(as_text=True)
 def test_new_snippet_form_loads(tmp_path):
    app = create_app(database_path=str(tmp_path / "snippets.db"))
    client = app.test_client()
    response = client.get("/new")
    assert response.status_code == 200
    html = response.get_data(as_text=True)
    assert 'name="title"' in html
    assert 'name="body"' in html
    assert 'name="language"' in html
    assert 'name="tags"' in html
    assert 'name="expires_at"' in html
 def test_form_post_redirects_to_snippet_view(tmp_path):
    app = create_app(database_path=str(tmp_path / "snippets.db"))
    client = app.test_client()
    response = client.post(
        "/new",
        data={
            "title": "Form title",
            "body": "Form body",
            "language": "text",
            "tags": "forms,html",
            "expires_at": "",
        },
    )
    assert response.status_code == 302
    assert response.headers["Location"].endswith("/snippets/1")
--- a/nightshift/task_tests.py
+++ b/nightshift/task_tests.py
@ -0,0 +1,48 @@
 """Task-specific test file validation."""
 from __future__ import annotations
 from dataclasses import dataclass
 from pathlib import Path
 from .commands import extract_test_file_paths, render_command_template
 from .config import COMMAND_STAGE_TYPES, NightShiftConfig
 from .tasks import Task
@dataclass(frozen=True)
 class TaskTestCheck:
    task_id: str
    path: str
    exists: bool
 def check_task_test_files(config: NightShiftConfig, tasks: tuple[Task, ...] | list[Task]) -> tuple[TaskTestCheck, ...]:
    checks: list[TaskTestCheck] = []
    for task in tasks:
        seen: set[str] = set()
        for stage in config.pipeline.stages:
            if stage.type not in COMMAND_STAGE_TYPES:
                continue
            for command in stage.commands:
                rendered = render_command_template(command, task.id)
                for path_text in extract_test_file_paths(rendered):
                    if path_text in seen:
                        continue
                    seen.add(path_text)
                    checks.append(TaskTestCheck(task.id, path_text, (config.project.root / path_text).exists()))
    return tuple(checks)
 def format_task_test_checks(checks: tuple[TaskTestCheck, ...]) -> str:
    if not checks:
        return "Task test files: no task-specific test paths detected."
    lines = ["Task test files:"]
    for check in checks:
        status = "ok" if check.exists else "missing"
        lines.append(f"- {check.task_id}: {check.path} ({status})")
    return "\n".join(lines)
 def missing_task_test_paths(checks: tuple[TaskTestCheck, ...]) -> tuple[Path, ...]:
    return tuple(Path(check.path) for check in checks if not check.exists)
--- a/tests/test_commands.py
+++ b/tests/test_commands.py
@ -6,6 +6,7 @@ from nightshift.artifacts import ArtifactStore
 from nightshift.commands import CommandExecutor
 from nightshift.commands import CommandRun, format_command_runs
 from nightshift.commands import _command_env
 from nightshift.commands import render_command_template
 from nightshift.config import SafetyConfig, StageConfig
 from nightshift.errors import CommandError
 import sys
@ -16,6 +17,13 @@ FAILING_COMMAND = 'python -c "import sys; print(\'bad\'); sys.exit(7)"'
 class CommandExecutorTests(unittest.TestCase):
    def test_render_command_template_includes_task_id_variants(self) -> None:
        command = "python -m pytest -q tests/test_{task_id_compact}.py # {task_id_slug} {task_id}"
        rendered = render_command_template(command, "TASK-001")
        self.assertEqual(rendered, "python -m pytest -q tests/test_task001.py # task_001 TASK-001")
    def test_passing_command_stage_returns_pass_and_writes_output(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            root = Path(directory)
@ -46,6 +54,33 @@ class CommandExecutorTests(unittest.TestCase):
            self.assertIn("Exit code: 0", output)
            self.assertIn("ok", output)
    def test_command_stage_renders_task_id_before_allowlist_check(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            root = Path(directory)
            artifacts = ArtifactStore(root, ".nightshift", run_id="test-run")
            executor = CommandExecutor(
                root,
                SafetyConfig(
                    require_clean_worktree=False,
                    scoped_paths=(".",),
                    allowed_commands=('python -c "print(\'{task_id_compact}\')"',),
                    forbidden_commands=("rm -rf",),
                ),
                artifacts,
            )
            stage = StageConfig(
                id="test",
                type="command",
                commands=('python -c "print(\'{task_id_compact}\')"',),
                output="test-output.txt",
            )
            result = executor.run_stage(stage, "TASK-002")
            self.assertEqual(result.status, "pass")
            output = (root / result.output_path).read_text(encoding="utf-8")
            self.assertIn("task002", output)
    def test_failing_command_stage_returns_fail_and_writes_output(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            root = Path(directory)
--- a/tests/test_config.py
+++ b/tests/test_config.py
@ -282,6 +282,27 @@ class ConfigTests(unittest.TestCase):
            self.assertEqual(config.agents["planner"].temperature, 0.2)
    def test_agent_ollama_options_load(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            root = Path(directory)
            init_project(root)
            config_path = root / "nightshift.yaml"
            config_path.write_text(
                config_path.read_text(encoding="utf-8").replace(
                    "    system_prompt: agents/planner.md",
                    "    system_prompt: agents/planner.md\n    num_ctx: 8192\n    num_predict: 4096\n    seed: 1\n    stop:\n      - STOP",
                    1,
                ),
                encoding="utf-8",
            )
            config = load_config(config_path)
            self.assertEqual(config.agents["planner"].num_ctx, 8192)
            self.assertEqual(config.agents["planner"].num_predict, 4096)
            self.assertEqual(config.agents["planner"].seed, 1)
            self.assertEqual(config.agents["planner"].stop, ("STOP",))
    def test_agent_temperature_must_be_number(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            root = Path(directory)
--- a/tests/test_init.py
+++ b/tests/test_init.py
@ -61,7 +61,7 @@ class InitProjectTests(unittest.TestCase):
        self.assertIn("tutorial-imageboard", available_templates())
        self.assertIn("tutorial-pastebin", available_templates())
-    def test_init_pastebin_template_creates_skeleton_and_model_fallback_config(self) -> None:
+    def test_init_pastebin_template_creates_skeleton_and_qwen3_config(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            root = Path(directory)
@ -78,11 +78,15 @@ class InitProjectTests(unittest.TestCase):
            self.assertIn("type: semantic_context", config)
            self.assertNotIn("id: write_tests", config)
            self.assertNotIn("id: review_tests", config)
-            self.assertIn("python -m pytest -q tests", config)
+            self.assertIn("python -m pytest -q tests/test_{task_id_compact}.py", config)
            self.assertIn("max_task_retries: 6", config)
-            self.assertIn("implementer_qwen", config)
+            self.assertIn("implementer:", config)
-            self.assertIn("carstenuhlig/omnicoder-9b", config)
+            self.assertIn("qwen3-coder:30b", config)
-            self.assertIn("deepseek-coder-v2:16b", config)
+            self.assertIn("num_ctx: 8192", config)
            self.assertIn("num_predict: 4096", config)
            self.assertNotIn("agent_pool:", config)
            self.assertNotIn("carstenuhlig/omnicoder-9b", config)
            self.assertNotIn("deepseek-coder-v2:16b", config)
    def test_pastebin_example_tutorial_docs_exist(self) -> None:
        root = Path(__file__).resolve().parents[1]
--- a/tests/test_integ_test.py
+++ b/tests/test_integ_test.py
@ -0,0 +1,51 @@
 from pathlib import Path
 import tempfile
 import unittest
 from nightshift.integ_report import build_integration_report, format_integration_report
 from nightshift.integ_test import format_integration_test_result, run_integration_test
 class IntegrationTestCommandTests(unittest.TestCase):
    def test_run_integration_test_dry_run_builds_task_command(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            result = run_integration_test(
                directory,
                template="tutorial-pastebin",
                task="TASK-001",
                dry_run=True,
            )
            rendered = format_integration_test_result(result)
            self.assertIn("Dry run: true", rendered)
            self.assertIn("TASK-001", " ".join(result.command))
            self.assertTrue((result.run.directory / "project" / "nightshift.yaml").exists())
    def test_build_integration_report_summarizes_latest_task_summary(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            root = Path(directory)
            summary = root / "integ_runs" / "20260521T000000.000000Z" / "project" / ".nightshift" / "runs" / "run1" / "tasks" / "TASK-001" / "run-summary.md"
            summary.parent.mkdir(parents=True)
            summary.write_text(
                "\n".join(
                    [
                        "# Run Summary",
                        "",
                        "- Task: TASK-001",
                        "- Status: complete",
                        "- Retry count: 1",
                        "- Reason: Done.",
                    ]
                ),
                encoding="utf-8",
            )
            report = build_integration_report(root)
            rendered = format_integration_report(report)
            self.assertIn("TASK-001 complete after 1 retries", rendered)
            self.assertIn("Reason: Done.", rendered)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_pipeline.py
+++ b/tests/test_pipeline.py
@ -105,6 +105,29 @@ class PipelineRunnerTests(unittest.TestCase):
            )
            self.assertIn("Modified Files", (root / ".nightshift" / "runs" / "test-run" / "run-summary.md").read_text(encoding="utf-8"))
    def test_task_preflight_fails_when_task_specific_test_file_is_missing(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            root = Path(directory)
            _write_common_files(root)
            stages = (
                StageConfig(
                    id="test",
                    type="command",
                    commands=("python -m pytest -q tests/test_{task_id_compact}.py",),
                    output="test-output.txt",
                ),
            )
            config = make_config(root, stages, max_retries=0)
            runner = PipelineRunner(config, ArtifactStore(root, ".nightshift", run_id="test-run"))
            task = parse_tasks(TASK_MD)[0]
            result = runner.run_task(task)
            self.assertEqual(result.status, "failed")
            self.assertIn("configured task test file is missing", result.reason)
            task_dir = root / ".nightshift" / "runs" / "test-run" / "tasks" / task.id
            self.assertIn("tests/test_task001.py", (task_dir / "preflight.md").read_text(encoding="utf-8"))
    def test_review_can_retry_implementation_until_limit(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            root = Path(directory)
--- a/tests/test_task_tests.py
+++ b/tests/test_task_tests.py
@ -0,0 +1,77 @@
 from pathlib import Path
 import tempfile
 import unittest
 from nightshift.config import validate_config
 from nightshift.task_tests import check_task_test_files, missing_task_test_paths
 from nightshift.tasks import parse_task_file
 class TaskTestValidationTests(unittest.TestCase):
    def test_check_task_test_files_renders_task_placeholder(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            root = Path(directory)
            (root / "agents").mkdir()
            (root / "agents" / "planner.md").write_text("Prompt", encoding="utf-8")
            (root / "tests").mkdir()
            (root / "tests" / "test_task001.py").write_text("def test_ok():\n    assert True\n", encoding="utf-8")
            (root / "nightshift.yaml").write_text(
                "\n".join(
                    [
                        "project:",
                        "  name: task-test-validation",
                        "  root: .",
                        "  task_file: tasks.md",
                        "  artifact_dir: .nightshift",
                        "",
                        "safety:",
                        "  require_clean_worktree: false",
                        "  scoped_paths:",
                        "    - .",
                        "  allowed_commands:",
                        "    - python -m pytest -q tests/test_{task_id_compact}.py",
                        "  forbidden_commands:",
                        "    - rm -rf",
                        "",
                        "agents:",
                        "  planner:",
                        "    backend: command",
                        "    command: python -c \"print('ok')\"",
                        "    system_prompt: agents/planner.md",
                        "",
                        "pipeline:",
                        "  stages:",
                        "    - id: test",
                        "      type: command",
                        "      commands:",
                        "        - python -m pytest -q tests/test_{task_id_compact}.py",
                    ]
                ),
                encoding="utf-8",
            )
            (root / "tasks.md").write_text(
                """# Tasks
 - [ ] TASK-001: One
 Acceptance Criteria:
 - passes
 - [ ] TASK-002: Two
 Acceptance Criteria:
 - reports missing test
 """,
                encoding="utf-8",
            )
            config = validate_config(root / "nightshift.yaml")
            tasks = parse_task_file(config.project.root, config.project.task_file)
            checks = check_task_test_files(config, tasks)
            self.assertEqual([check.path for check in checks], ["tests/test_task001.py", "tests/test_task002.py"])
            self.assertEqual(tuple(path.as_posix() for path in missing_task_test_paths(checks)), ("tests/test_task002.py",))
 if __name__ == "__main__":
    unittest.main()