Add tutorial integration workflow helpers

- Add `integ-test` to create, set up, validate, and run integration template tasks - Add `integ-report` to summarize latest integration run artifacts - Switch default pastebin template from model fallback to single `qwen3-coder:30b` - Support optional Ollama fields: `num_ctx`, `num_predict`, `seed`, and `stop` - Add `nightshift validate` preflight for task-specific test files - Update pastebin docs, config reference, and ideas tracking - Add tests for integration helpers, task-test validation, config parsing, and template expectations
2026-06-14 10:08:37 +00:00 · 2026-05-21 03:46:27 -07:00 · 2026-05-21 03:46:27 -07:00 · f7fed4535b
commit f7fed4535b
parent e3679296fd
29 changed files with 1251 additions and 280 deletions
--- a/docs/bugfix_todo.md
+++ b/docs/bugfix_todo.md
@ -1,195 +0,0 @@
-# Bugfix TODO
-
-## Some issues going with run --all
- reason=Stage 'review' requested unknown next stage 'None'. Not every time. I think there's a pattern that is out of place here. Maybe it's related to the last task success? Or the last run?
-
-
-
-## Going from individual tasks to --all fails
-
-If you do nightshift run --task TASK-001 and then that completes and then you go to nightshift run --all it fails on blocked by missing dependencies: TASK-001 . I think this is because the tasks get reset at the top of the run, but there is something marking completion of TASK-001 requiring manual reset.
-
-run --all should start at the first not done task (seems like it does)
-
-## Some kind of tool install feature
-
-Continually fails on flask_sqlalchemy until I install that.
-
-## Tutorial need to include . directory for imageboard
-
-## Git status artifacts are noisy for non-git repositories
-
-Observed artifact:
-
-```text
-# Git Status before
-
-Available: false
-Exit code: 128
-
-fatal: not a git repository (or any of the parent directories): .git
-```
-
-Current behavior:
-
- NightShift continues when `require_clean_worktree: false`.
- `git-status-before.txt`, `git-status-after.txt`, and `diff.patch` may contain git errors.
- This is technically safe, but confusing for users running quickstart/demo projects outside git.
-
-Desired behavior:
-
- Detect non-git repositories explicitly.
- Write a clearer artifact message such as:
-
-```text
-Git repository: false
-Clean-worktree enforcement: skipped because require_clean_worktree is false
-Diff artifact: unavailable because project is not a git repository
-```
-
- Avoid treating non-git as a scary-looking failure when clean worktree is not required.
-
-Acceptance criteria:
-
- Non-git projects produce readable git artifacts without fatal-looking output.
- `require_clean_worktree: true` still fails safely in non-git projects.
- Reports mention that git metadata/diff is unavailable because the project is not a git repo.
-
-## Git safe.directory / ownership conflicts on Windows
-
-Observed context:
-
- Git can report dubious ownership or safe-directory errors when a repo was created or managed by a different Windows user identity.
- This may happen when using GitHub Desktop, WSL, admin shells, or multiple Windows accounts.
-
-Current behavior:
-
- NightShift records the raw git error in artifacts.
- If `require_clean_worktree: true`, NightShift blocks execution.
- If `require_clean_worktree: false`, NightShift continues but git status/diff artifacts can look like hard failures.
-
-Desired behavior:
-
- Detect common `dubious ownership` / `safe.directory` messages.
- Write a clearer explanation in artifacts and reports.
- Suggest the exact remediation outside NightShift, for example:
-
-```powershell
-git config --global --add safe.directory <project-root>
-```
-
-Acceptance criteria:
-
- Safe-directory failures are classified separately from ordinary git failures.
- Users get actionable guidance.
- NightShift does not attempt to change global git config automatically.
-
-## Clarify docs around git requirements
-
-Add to `QUICKSTART.md` and troubleshooting:
-
- Git is optional when `require_clean_worktree: false`.
- Git is required for clean-worktree enforcement and useful diffs.
- Non-git projects can still run pipelines.
- Git ownership/safe-directory errors affect git artifacts, not core task execution, unless clean-worktree enforcement is enabled.
-
-## Console appears idle during long agent calls
-
-Current behavior:
-
- Long Ollama calls can make `nightshift run` look frozen.
- Progress is only visible by inspecting `.nightshift/` artifacts or `ollama ps`.
-
-Desired behavior:
-
- Print stage start/finish messages to the console.
- Include agent id, stage id, task id, and artifact path when available.
- Do not stream model output yet; just show lifecycle progress.
-
-Acceptance criteria:
-
- User can tell which stage is running.
- Long-running model calls no longer look like a hung process.
-
-## Ollama output can make review stages fail if not structured
-
-Current behavior:
-
- Review stages require `status: pass | fail | retry | escalate`.
- General-purpose model output may include prose before/after the structured fields.
- If no valid status is found, the review stage fails.
-
-Desired behavior:
-
- Keep strict structured review parsing, but improve prompt templates and error messages.
- Artifact should clearly say the review output was unparseable and show the expected contract.
-
-Acceptance criteria:
-
- Failed review parsing is easy to diagnose from `review.md` and `stage-results.md`.
-
-## `echo` fake agents do not behave consistently across shells
-
-Current behavior:
-
- Starter templates use `command: echo`.
- Depending on shell/platform, `echo` may not preserve stdin or may only echo arguments.
- This can make fake agent artifacts less useful.
-
-Desired behavior:
-
- Replace fake-agent defaults with small Python one-liners or documented fake-agent scripts.
- Keep examples cross-platform.
-
-Acceptance criteria:
-
- Starter project produces predictable fake-agent output on Windows PowerShell/cmd and Unix shells.
-
-## `unittest discover` behavior depends on test package layout
-
-Current behavior:
-
- Python 3.14 returned `NO TESTS RAN` with exit code 5 for an example project until `tests/__init__.py` was added.
- Users may hit the same issue in fresh target repos.
-
-Desired behavior:
-
- Document this in troubleshooting.
- Consider making quickstart templates include `tests/__init__.py`.
-
-Acceptance criteria:
-
- Quickstart test command works in a fresh copied example.
- Troubleshooting mentions what to do if `NO TESTS RAN` appears.
-
-## Task completion can mark tasks complete even if no source changed
-
-Current behavior:
-
- A pipeline can pass with fake agents and passing tests, then mark the task complete.
- This is expected for fake/demo mode but surprising when users expect code edits.
-
-Desired behavior:
-
- Add a warning when a task completes and git/diff detects no source changes, where git is available.
- Documentation should explain fake-agent mode vs editing-agent mode.
-
-Acceptance criteria:
-
- Users are less likely to mistake artifact generation for code modification.
-
-## Dashboard requires Flask but dependency is optional
-
-Current behavior:
-
- `nightshift web` fails with a helpful message if Flask is missing.
- README mentions `pip install flask`, but install extras are not defined.
-
-Desired behavior:
-
- Add an optional dependency group such as `nightshift[web]` later.
- Keep graceful error behavior.
-
-Acceptance criteria:
-
- Users have one documented install command for dashboard support.
--- a/docs/config-reference.md
+++ b/docs/config-reference.md
@ -62,11 +62,19 @@ Ollama agent:
 ```yaml
 planner:
  backend: ollama
-  model: qwen2.5-coder:14b
+  model: qwen3-coder:30b
  base_url: http://localhost:11434
  system_prompt: agents/planner.md
+  temperature: 0.2
+  num_ctx: 8192
+  num_predict: 4096
+  seed: 1
+  stop:
+    - STOP
 ```

+Optional Ollama generation options currently supported by NightShift are `temperature`, `num_ctx`, `num_predict`, `seed`, and `stop`.
+
 ## `pipeline`

 - `max_task_retries`: task retry limit.
@ -76,6 +84,7 @@ planner:
 Command stage options:

 - `commands`: command strings.
+- Command strings may use task placeholders: `{task_id}`, `{task_id_lower}`, `{task_id_slug}`, and `{task_id_compact}`.
 - `shell`: defaults to true. Set false for argv-style execution.
 - `timeout_seconds`: per-stage timeout override.
 - `working_dir`: command working directory inside project root.
@ -141,6 +150,12 @@ Create a local integration sandbox from the NightShift repository root:
 python -m nightshift.cli integ-run --template tutorial-pastebin
 ```

+Create, set up, validate, and run one task from the generated project directory:
+
+```bash
+python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
+```
+
 Set up the generated Python project:

 ```bash
@ -161,6 +176,12 @@ Preview commands without running them:
 python -m nightshift.cli integ-setup --project integ_runs/<timestamp>/project --dry-run
 ```

+Summarize the latest integration artifact run:
+
+```bash
+python -m nightshift.cli integ-report --latest
+```
+
 To clean up old sandboxes before creating a new one, keep only the newest three existing runs:

 ```bash
@ -169,8 +190,4 @@ python -m nightshift.cli integ-run --template tutorial-pastebin --keep 3

 ## Pastebin Tutorial

-`nightshift init --template tutorial-pastebin` creates a small Flask snippet-hosting target with deterministic tests and incremental NightShift tasks. Its pipeline includes semantic context retrieval, telemetry, debugger support, and implementation fallback order:
-
- `qwen2.5-coder:14b`
- `carstenuhlig/omnicoder-9b`
- `deepseek-coder-v2:16b`
+`nightshift init --template tutorial-pastebin` creates a small Flask snippet-hosting target with deterministic tests and incremental NightShift tasks. Its pipeline includes semantic context retrieval, telemetry, debugger support, fixed task-specific tests, and a single default `qwen3-coder:30b` model path.
--- a/docs/future_ideas.md
+++ b/docs/future_ideas.md
@ -0,0 +1,17 @@
+### Future Ideas
+Not to implement until we get successful long running runs.
+
+## I am realizing "templates" are abstracted from the user
+* I think templates will be a first class citizen, a package for deployments, and a harness for performance tests
+* These should live external to nightshift/project_templates as users will likely create their own 
+* one solution would be to reference two directories when looking up templates, builtin ones will be in nightshift/project_templates or users can define a templates directory in their nightshift config
+
+## nightshift config
+* store user settings in ~/.nightshift/config.yaml
+* things like templates folder (can also live here)
+* maybe this is later
+
+## A way to easily make A/B tests to benchmark models?
+* Right now I can do this manually, for example I want to run the tutorial-pastebin with qwen3.6:27b as the planner and qwen2.5-coder:14b as the coder, and another with qwen3.6:27b as both, etc.
+* Maybe there is a way to make it easier to do that, possibly by creating a template that can be controlled by a larger multi-run file?
+* This is probably for way later.
--- a/docs/ideas.md
+++ b/docs/ideas.md
@ -0,0 +1,366 @@
+# Ideas TODO
+
+This file is now prioritized inline. Priority scale:
+
+- P0: do next; directly improves current feedback loop
+- P1: important after the current loop is usable
+- P2: useful, but only after basics are stable
+- P3: defer or maybe reject
+
+## P0: Make Integration Tests Easy To Run
+
+Status: implemented.
+
+Implemented command:
+
+```powershell
+python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
+```
+
+It creates the integration sandbox, sets up the venv, runs validation through setup, runs the task from the generated project directory, and prints the artifact root. Use `--dry-run` to preview the setup and task command.
+
+Running integration tests is still too manual.
+
+Current process:
+
+- install the current version of NightShift
+- run `python -m nightshift.cli integ-run --template tutorial-pastebin --setup`
+- copy the activation line from the output and run it
+- `cd` into the generated directory
+- run the task there, because running from the repo root does not find `nightshift.yaml`
+
+Recommendation: implement a wrapper command, not just a loose script.
+
+Target command:
+
+```powershell
+python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
+```
+
+It should:
+
+1. create the integration run
+2. set up the venv
+3. install NightShift from the current checkout
+4. run `nightshift validate`
+5. run the selected task from the generated project directory
+6. print final status and artifact path
+
+Useful variants:
+
+```powershell
+python -m nightshift.cli integ-test --template tutorial-pastebin --all
+python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-002 --keep 3
+```
+
+The base-directory config issue may not be a core bug, but it is bad UX. The wrapper should handle `cwd` correctly.
+
+## P0/P1: Remove Multi-Candidate Workflow From Default Pastebin
+
+Status: implemented for the default pastebin template and tutorial example.
+
+Original idea:
+
+- The multi-candidate workflow does not add as much as expected.
+- Keep it as an example, maybe `example-multiagent`.
+
+Recommendation: yes. Remove it from the default pastebin tutorial.
+
+Reason:
+
+- Pastebin is becoming the reliability harness.
+- Multi-candidate fallback makes artifacts harder to reason about.
+- It adds model variability while we are still debugging pipeline behavior.
+
+Better split:
+
+```text
+tutorial-pastebin
+tutorial-pastebin-multiagent
+```
+
+or:
+
+```text
+examples/templates/multiagent-fallback
+```
+
+Default pastebin should be boring:
+
+```text
+planner -> semantic_context -> context -> implement -> validate -> test -> review
+```
+
+Use one strong implementer first. Add fallback only in a separate experiment template.
+
+## P1: Add A Qwen3 / 30B Pastebin Variant
+
+Status: implemented as the default pastebin model path using `qwen3-coder:30b`.
+
+Original idea:
+
+- Use a non-coder model for planner roles.
+- Try `qwen3.6:27b` for planning.
+- Use `qwen3-coder:30b` for implementer and code-heavy roles.
+
+Recommendation: viable, but make this a variant, not the default.
+
+kass reply- No lets make this the default. the qwen3-coder:30b is fast now for me for some reason.
+
+Suggested template/config:
+
+```text
+tutorial-pastebin-qwen3
+```
+
+Possible role split:
+
+- planner: `qwen3.6:27b`
+- reviewer/debugger: `qwen3.6:27b`
+- implementer: `qwen3-coder:30b` or exact local 30B coder model name
+
+Important: confirm exact model names with:
+
+```powershell
+ollama list
+```
+
+i did its `qwen3-coder:30b`
+
+Use 30B where it pays:
+
+- first implementation for hard tasks
+- repair after concrete test failure
+- schema/database changes
+- multi-file changes
+
+Do not blindly make every stage 30B if it is slow.
+
+reply: Its not slow now!`qwen3-coder:30b`
+ 
+## P2: Expose More Model Parameters
+
+Status: implemented for the practical first set.
+
+Supported optional Ollama fields now include `num_ctx`, `num_predict`, `seed`, and `stop`, in addition to existing `temperature`.
+
+Original question:
+
+- What else besides temperature is available?
+- Are any worth optimizing?
+
+Likely useful for Ollama:
+
+- `temperature`
+- `num_ctx`
+- `num_predict`
+- `seed`
+- `stop`
+- maybe `top_p`, `top_k`, `repeat_penalty`
+
+Recommendation: add only a small practical set first.
+
+Useful config shape:
+
+```yaml
+temperature: 0.1
+num_ctx: 8192
+num_predict: 4096
+seed: 1
+```
+
+Most useful:
+
+- `num_ctx`: larger repo/task context
+- `num_predict`: caps runaway output
+- `seed`: reproducibility, if supported consistently
+- `temperature`: already useful; keep low for code
+- `stop`: could help enforce file-block or diff-only contracts
+
+Defer tuning `top_p`, `top_k`, and `repeat_penalty` unless a specific model needs it.
+
+reply: yup lets put this in the nightshift.yaml (optional parameters, if they arent in there that's fine, but we should offer them.)
+
+## P1: Add Test Governance For Generated Tests
+
+Original idea:
+
+- Have a test governance layer for when agents write tests.
+- A reviewer validates alignment with acceptance criteria.
+
+Recommendation: yes, but only for generated-test mode. Do not put generated tests back into default pastebin yet.
+
+The previous failures proved test-writing agents will:
+
+- edit app code
+- import nonexistent modules
+- require undeclared dependencies
+- inspect implementation internals
+- write tests for future behavior
+
+Governance should be deterministic first, model-reviewed second.
+
+Deterministic checks:
+
+- test-writing stage may only touch `tests/`
+- tests compile
+- tests import only allowed public interfaces
+- tests do not import undeclared dependencies
+- tests do not define Flask routes or app implementation
+- test names match current task id or current artifact
+- no future-task keywords unless accepted by current task AC
+
+Then optional model reviewer checks acceptance-criteria alignment.
+
+## P2: Add A Test Analyzer Agent For TDD
+
+Original idea:
+
+- Analyze tests.
+- Translate them into direct instructions for the implementer.
+- Maybe implement using agent YAML definitions without new NightShift features.
+
+Recommendation: viable, but defer until generated tests are stable.
+
+Possible pipeline:
+
+```text
+write_tests -> validate_tests -> analyze_tests -> implement
+```
+
+Analyzer output should be concrete:
+
+```text
+Implementation requirements:
+- create_app(database_path) must return a Flask app.
+- POST /snippets must return 201 and JSON id.
+- GET /snippets/<id> must return persisted fields.
+
+Do not modify:
+- tests/test_task001.py
+```
+
+This may help smaller models, but it is another model output that can be wrong. Add it only after the fixed-test pipeline works through all pastebin tasks.
+
+## P2/P3: Add A Test Planner
+
+Original idea:
+
+- A test planner understands acceptance criteria and code.
+- Provides input to the next stage about constraints and code, especially for non-TDD.
+
+Recommendation: maybe, but defer.
+
+This overlaps with:
+
+- planner
+- test analyzer
+- test governance
+
+Too many planning-ish stages can make the pipeline bloated and contradictory.
+
+If implemented later, keep it focused:
+
+```text
+test_planner -> write_tests -> test_governance -> implement
+```
+
+For now, fold this idea into the future test governance/analyzer work.
+
+## P1: Add Fixed Tests For All Pastebin Tasks
+
+Status: mostly implemented in the template.
+
+Current fixed tests:
+
+```text
+tests/test_task001.py
+tests/test_task002.py
+tests/test_task003.py
+tests/test_task004.py
+tests/test_task005.py
+```
+
+Important design:
+
+```yaml
+python -m pytest -q tests/test_{task_id_compact}.py
+```
+
+This lets all future task tests exist without breaking earlier tasks.
+
+Next step: validate these through integration runs, one task at a time.
+
+## P1: Add `nightshift integ-report`
+
+Status: implemented as a first-pass artifact summarizer.
+
+New idea.
+
+Summarize latest integration run across tasks:
+
+```text
+TASK-001 complete in 1 retry
+TASK-002 failed at validate_patch
+Root cause: protected tests modified
+Artifacts: ...
+```
+
+Right now we inspect artifacts manually. NightShift should do more of that.
+
+Possible command:
+
+```powershell
+python -m nightshift.cli integ-report --latest
+```
+
+## P1: Add Task-Test Preflight To `validate`
+
+Status: implemented.
+
+`nightshift validate` now renders task command placeholders for every task and fails early if a configured `tests/test_*.py` path is missing.
+
+Partially implemented at run time.
+
+Current behavior:
+
+- task command placeholders can render paths like `tests/test_task002.py`
+- `run_task` preflight fails before invoking agents if the task-specific test file is missing
+
+Better behavior:
+
+```powershell
+nightshift validate
+```
+
+should warn or fail:
+
+```text
+TASK-003 expects tests/test_task003.py and it exists.
+TASK-004 expects tests/test_task004.py and it exists.
+```
+
+This catches missing fixed tests earlier.
+
+## P2: Add Run Comparison
+
+New idea.
+
+Useful once comparing 14B vs 30B:
+
+```powershell
+nightshift compare-runs --latest 5
+```
+
+Show:
+
+- model
+- task
+- retries
+- failure stage
+- final reason
+- runtime
+- token estimate
+
+This should come after `integ-test` and `integ-report`.
+
--- a/examples/tutorial/03-pastebin/README.md
+++ b/examples/tutorial/03-pastebin/README.md
@ -1,4 +1,4 @@
-# Tutorial 03: Pastebin With Model Fallback And Telemetry
+# Tutorial 03: Pastebin With Fixed Tests And Telemetry

 This tutorial uses the `tutorial-pastebin` template: a small Flask snippet-hosting service designed for deterministic NightShift orchestration tests.

@ -19,6 +19,12 @@ For an isolated local integration run, use the integration sandbox command from
 python -m nightshift.cli integ-run --template tutorial-pastebin
 ```

+To create, set up, validate, and run one task in a single command:
+
+```bash
+python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
+```
+
 To create the sandbox and set up the Python project immediately:

 ```bash
@ -57,7 +63,7 @@ pyproject.toml
 README.md
 ```

-The template includes a tiny Flask `create_app(database_path=None)` scaffold and fixed `TASK-001` tests. The default tutorial pipeline asks the implementation agent to make those deterministic tests pass before review.
+The template includes a tiny Flask `create_app(database_path=None)` scaffold and fixed tests for each tutorial task. The default tutorial pipeline asks the implementation agent to make only the current task's deterministic tests pass before review.

 ## Prerequisites

@ -73,26 +79,22 @@ Install target dependencies:
 python -m pip install -e . pytest flask
 ```

-Install and start Ollama, then pull the fallback models you want available:
+Install and start Ollama, then pull the default pastebin model:

 ```bash
-ollama pull qwen2.5-coder:14b
-ollama pull carstenuhlig/omnicoder-9b
-ollama pull deepseek-coder-v2:16b
+ollama pull qwen3-coder:30b
 ollama list
 ```

 NightShift uses Ollama's local HTTP API, normally at `http://localhost:11434`.

-## Model Fallback
+## Model

-The implementation stage uses this fallback order:
+The default pastebin pipeline uses one strong local coder model:

-1. `qwen2.5-coder:14b`
-2. `carstenuhlig/omnicoder-9b`
-3. `deepseek-coder-v2:16b`
+- `qwen3-coder:30b`

-NightShift records which agent/model handled each stage in `telemetry-summary.md`.
+NightShift records which agent/model handled each stage in `telemetry-summary.md`. Multi-candidate fallback belongs in a separate experiment template, not the default pastebin reliability harness.

 ## TDD Pipeline

--- a/examples/tutorial/03-pastebin/nightshift.yaml
+++ b/examples/tutorial/03-pastebin/nightshift.yaml
@ -20,51 +20,49 @@ safety:
    - curl | bash

 experiment:
-  label: pastebin-model-fallback
-  prompt_variant: tdd-qwen-omnicoder-deepseek-v2
+  label: pastebin-qwen3-coder
+  prompt_variant: fixed-tests-qwen3-coder-30b-v1

 agents:
  planner:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.2
+    num_ctx: 8192
+    num_predict: 4096
    system_prompt: .nightshift/agents/planner.md

-  implementer_qwen:
+  implementer:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.1
+    num_ctx: 8192
+    num_predict: 4096
    system_prompt: .nightshift/agents/implementer.md

  test_writer:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.1
+    num_ctx: 8192
+    num_predict: 4096
    system_prompt: .nightshift/agents/test-writer.md

-  implementer_omnicoder:
-    backend: ollama
-    model: carstenuhlig/omnicoder-9b
-    temperature: 0.1
-    system_prompt: .nightshift/agents/implementer.md
-
-  implementer_deepseek:
-    backend: ollama
-    model: deepseek-coder-v2:16b
-    temperature: 0.1
-    system_prompt: .nightshift/agents/implementer.md
-
  debugger:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    role: debugger
    temperature: 0.1
+    num_ctx: 8192
+    num_predict: 4096
    system_prompt: .nightshift/agents/debugger.md

  reviewer:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.1
+    num_ctx: 8192
+    num_predict: 4096
    system_prompt: .nightshift/agents/reviewer.md

 pipeline:
@ -87,10 +85,7 @@ pipeline:

    - id: implement
      type: file_writer
-      agent_pool:
-        - implementer_qwen
-        - implementer_omnicoder
-        - implementer_deepseek
+      agent: implementer
      output: proposed.patch

    - id: normalize
--- a/nightshift/agents.py
+++ b/nightshift/agents.py
@ -228,8 +228,9 @@ class AgentExecutor:
            "prompt": prompt,
            "stream": False,
        }
-        if agent.temperature is not None:
-            body["options"] = {"temperature": agent.temperature}
+        options = _ollama_options(agent)
+        if options:
+            body["options"] = options
        headers = {"Content-Type": "application/json"}
        started = time.monotonic()
        self.logger.event(
@ -395,6 +396,21 @@ def build_prompt_bundle(
    )


+def _ollama_options(agent: AgentConfig) -> dict[str, object]:
+    options: dict[str, object] = {}
+    if agent.temperature is not None:
+        options["temperature"] = agent.temperature
+    if agent.num_ctx is not None:
+        options["num_ctx"] = agent.num_ctx
+    if agent.num_predict is not None:
+        options["num_predict"] = agent.num_predict
+    if agent.seed is not None:
+        options["seed"] = agent.seed
+    if agent.stop:
+        options["stop"] = list(agent.stop)
+    return options
+
+
 def _coerce_output(value: str | bytes | None) -> str:
    if value is None:
        return ""
--- a/nightshift/cli.py
+++ b/nightshift/cli.py
@ -7,13 +7,16 @@ from pathlib import Path
 import sys

 from .config import validate_config
-from .errors import NightShiftError
+from .errors import ConfigError, NightShiftError
 from .init import available_templates, init_project
 from .integ import create_integration_run
+from .integ_report import build_integration_report, format_integration_report
 from .integ_setup import format_setup_result, setup_python_project
+from .integ_test import format_integration_test_result, run_integration_test
 from .pipeline import PipelineRunner
 from .runlog import RunLogger
 from .status import build_status, format_status
+from .task_tests import check_task_test_files, format_task_test_checks, missing_task_test_paths
 from .terminal import HOTDOG_ANIMATIONS, TerminalAnimation, format_banner, style_text
 from .tasks import (
    ensure_dependencies_satisfied,
@ -105,6 +108,33 @@ def build_parser() -> argparse.ArgumentParser:
        help="Print --setup commands without running them.",
    )

+    integ_test_parser = subparsers.add_parser(
+        "integ-test",
+        help="Create, set up, validate, and run an integration template task.",
+    )
+    integ_test_parser.add_argument("--root", default=".", help="Repository root where integ_runs/ is created.")
+    integ_test_parser.add_argument(
+        "--template",
+        default="tutorial-pastebin",
+        choices=available_templates(),
+        help="Template to initialize inside the sandbox.",
+    )
+    integ_test_parser.add_argument("--task", help="Specific task id to run.")
+    integ_test_parser.add_argument("--all", action="store_true", help="Run all runnable incomplete tasks.")
+    integ_test_parser.add_argument("--keep", type=int, help="Keep only the newest N old integration runs before creating a new one.")
+    integ_test_parser.add_argument(
+        "--setup-extra",
+        action="append",
+        default=["pytest"],
+        help="Extra package to install during setup. May be repeated. Defaults to pytest.",
+    )
+    integ_test_parser.add_argument("--setup-skip-validate", action="store_true", help="Skip validation during setup.")
+    integ_test_parser.add_argument("--dry-run", action="store_true", help="Print commands without running setup or tasks.")
+
+    integ_report_parser = subparsers.add_parser("integ-report", help="Summarize the latest integration run.")
+    integ_report_parser.add_argument("--root", default=".", help="Repository root where integ_runs/ is located.")
+    integ_report_parser.add_argument("--latest", action="store_true", help="Report the latest integration run.")
+
    setup_parser = subparsers.add_parser(
        "integ-setup",
        help="Set up a Python integration project venv and dependencies.",
@ -160,12 +190,18 @@ def main(argv: list[str] | None = None) -> int:
            config = validate_config(args.config)
            tasks = parse_task_file(config.project.root, config.project.task_file)
            validate_task_dependencies(tasks)
+            task_test_checks = check_task_test_files(config, tasks)
+            missing_task_tests = missing_task_test_paths(task_test_checks)
+            if missing_task_tests:
+                details = format_task_test_checks(task_test_checks)
+                raise ConfigError(f"Config error: missing configured task test files.\n{details}")
            incomplete = sum(1 for task in tasks if not task.completed)
            print(f"Config valid: {config.path}")
            print(f"Project: {config.project.name}")
            print(f"Stages: {len(config.pipeline.stages)}")
            print(f"Tasks: {len(tasks)}")
            print(f"Incomplete tasks: {incomplete}")
+            print(format_task_test_checks(task_test_checks))
            return 0

        if args.command == "run":
@ -256,6 +292,25 @@ def main(argv: list[str] | None = None) -> int:
            print(format_setup_result(result))
            return 0

+        if args.command == "integ-test":
+            result = run_integration_test(
+                args.root,
+                template=args.template,
+                task=args.task,
+                all_tasks=args.all,
+                keep=args.keep,
+                setup_extras=tuple(args.setup_extra or ()),
+                skip_setup_validate=args.setup_skip_validate,
+                dry_run=args.dry_run,
+            )
+            print(format_integration_test_result(result))
+            return result.exit_code
+
+        if args.command == "integ-report":
+            report = build_integration_report(args.root, latest=True)
+            print(format_integration_report(report))
+            return 0
+
    except NightShiftError as exc:
        print(str(exc), file=sys.stderr)
        return 1
--- a/nightshift/commands.py
+++ b/nightshift/commands.py
@ -5,6 +5,7 @@ from __future__ import annotations
 from dataclasses import dataclass
 import os
 from pathlib import Path
+import re
 import shlex
 import subprocess
 import sys
@ -68,11 +69,16 @@ class CommandExecutor:
                command_index=index,
                command=command,
            )
+            rendered_command = render_command_template(command, task_id)
+            rendered_allowed_commands = tuple(
+                render_command_template(allowed, task_id) for allowed in self.safety.allowed_commands
+            )
            run = self.run_command(
-                command,
+                rendered_command,
                shell=stage.shell,
                timeout_seconds=stage.timeout_seconds,
                working_dir=stage.working_dir,
+                allowed_commands=rendered_allowed_commands,
            )
            runs.append(run)
            self.logger.event(
@ -120,11 +126,12 @@ class CommandExecutor:
        shell: bool = True,
        timeout_seconds: int | None = None,
        working_dir: Path | None = None,
+        allowed_commands: tuple[str, ...] | None = None,
    ) -> CommandRun:
        try:
            normalized = ensure_command_allowed(
                command,
-                self.safety.allowed_commands,
+                allowed_commands if allowed_commands is not None else self.safety.allowed_commands,
                self.safety.forbidden_commands,
            )
        except SafetyError as exc:
@ -210,6 +217,27 @@ def format_command_runs(stage_id: str, runs: list[CommandRun]) -> str:
    return "\n".join(lines)


+def render_command_template(command: str, task_id: str) -> str:
+    task_id_lower = task_id.lower()
+    task_id_slug = task_id_lower.replace("-", "_")
+    task_id_compact = task_id_lower.replace("-", "")
+    return command.format(
+        task_id=task_id,
+        task_id_lower=task_id_lower,
+        task_id_slug=task_id_slug,
+        task_id_compact=task_id_compact,
+    )
+
+
+def extract_test_file_paths(command: str) -> tuple[str, ...]:
+    paths: list[str] = []
+    for match in re.finditer(r"(?<![\w./\\-])(tests[\\/][^\s`'\"<>|&;]+\.py)", command):
+        path = match.group(1).replace("\\", "/")
+        if path not in paths:
+            paths.append(path)
+    return tuple(paths)
+
+
 def _coerce_output(value: str | bytes | None) -> str:
    if value is None:
        return ""
--- a/nightshift/config.py
+++ b/nightshift/config.py
@ -46,6 +46,10 @@ class AgentConfig:
    temperature: float | None = None
    base_url: str | None = None
    api_key_env: str | None = None
+    num_ctx: int | None = None
+    num_predict: int | None = None
+    seed: int | None = None
+    stop: tuple[str, ...] = ()


@dataclass(frozen=True)
@ -207,10 +211,18 @@ def parse_config(raw: dict[str, Any], config_path: Path) -> NightShiftConfig:
            agent_raw.get("temperature"),
            f"agents.{agent_id}.temperature",
        )
+        num_ctx = _optional_int_or_none(agent_raw.get("num_ctx"), f"agents.{agent_id}.num_ctx")
+        num_predict = _optional_int_or_none(agent_raw.get("num_predict"), f"agents.{agent_id}.num_predict")
+        seed = _optional_int_or_none(agent_raw.get("seed"), f"agents.{agent_id}.seed")
+        stop = _string_tuple(agent_raw.get("stop", []), f"agents.{agent_id}.stop")
        if temperature is not None and temperature < 0:
            raise ConfigError(
                f"Config error: agents.{agent_id}.temperature must be zero or greater."
            )
+        if num_ctx is not None and num_ctx <= 0:
+            raise ConfigError(f"Config error: agents.{agent_id}.num_ctx must be greater than zero.")
+        if num_predict is not None and num_predict <= 0:
+            raise ConfigError(f"Config error: agents.{agent_id}.num_predict must be greater than zero.")
        if backend not in {"command", "ollama", "openai_compatible"}:
            raise ConfigError(
                f"Config error: agent '{agent_id}' uses unsupported backend '{backend}'. "
@ -243,6 +255,10 @@ def parse_config(raw: dict[str, Any], config_path: Path) -> NightShiftConfig:
            temperature=temperature,
            base_url=base_url,
            api_key_env=api_key_env,
+            num_ctx=num_ctx,
+            num_predict=num_predict,
+            seed=seed,
+            stop=stop,
        )

    experiment_raw = raw.get("experiment", {})
--- a/nightshift/integ_report.py
+++ b/nightshift/integ_report.py
@ -0,0 +1,71 @@
+"""Summarize integration run artifacts."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from pathlib import Path
+import re
+
+from .errors import NightShiftError
+
+
+@dataclass(frozen=True)
+class IntegrationReport:
+    integration_run: Path
+    nightshift_run: Path | None
+    lines: tuple[str, ...]
+
+
+def build_integration_report(root: str | Path = ".", *, latest: bool = True) -> IntegrationReport:
+    base = Path(root).resolve() / "integ_runs"
+    if not base.exists():
+        raise NightShiftError(f"Integration report error: no integ_runs directory found: {base}")
+    runs = sorted((path for path in base.iterdir() if path.is_dir()), key=lambda path: path.name, reverse=True)
+    if not runs:
+        raise NightShiftError(f"Integration report error: no integration runs found under: {base}")
+    integration_run = runs[0] if latest else runs[0]
+    artifacts_root = integration_run / "project" / ".nightshift" / "runs"
+    if not artifacts_root.exists():
+        return IntegrationReport(
+            integration_run,
+            None,
+            ("No NightShift run artifacts found. Setup may have failed before task execution.",),
+        )
+    nightshift_runs = sorted((path for path in artifacts_root.iterdir() if path.is_dir()), key=lambda path: path.name, reverse=True)
+    if not nightshift_runs:
+        return IntegrationReport(integration_run, None, ("No NightShift run directories found.",))
+    nightshift_run = nightshift_runs[0]
+    summaries = sorted(nightshift_run.glob("tasks/*/run-summary.md"))
+    if not summaries and (nightshift_run / "run-summary.md").exists():
+        summaries = [nightshift_run / "run-summary.md"]
+    lines = [_summarize_run_summary(path, integration_run) for path in summaries]
+    return IntegrationReport(integration_run, nightshift_run, tuple(lines or ("No task summaries found.",)))
+
+
+def format_integration_report(report: IntegrationReport) -> str:
+    lines = [f"Integration run: {report.integration_run}"]
+    if report.nightshift_run is not None:
+        lines.append(f"NightShift run: {report.nightshift_run}")
+    lines.append("")
+    lines.extend(f"- {line}" for line in report.lines)
+    return "\n".join(lines)
+
+
+def _summarize_run_summary(path: Path, integration_run: Path) -> str:
+    text = path.read_text(encoding="utf-8", errors="replace")
+    task = _field(text, "Task") or path.parent.name
+    status = _field(text, "Status") or "unknown"
+    retries = _field(text, "Retry count") or "unknown"
+    reason = _field(text, "Reason") or "no reason recorded"
+    try:
+        relative = path.relative_to(integration_run)
+    except ValueError:
+        relative = path
+    return f"{task} {status} after {retries} retries. Reason: {reason}. Artifacts: {relative.parent}"
+
+
+def _field(text: str, name: str) -> str | None:
+    match = re.search(rf"^- {re.escape(name)}:\s*(.+)$", text, flags=re.MULTILINE)
+    if not match:
+        return None
+    return match.group(1).strip()
--- a/nightshift/integ_test.py
+++ b/nightshift/integ_test.py
@ -0,0 +1,71 @@
+"""End-to-end integration test wrapper."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from pathlib import Path
+import subprocess
+
+from .errors import NightShiftError
+from .integ import IntegrationRun, create_integration_run
+from .integ_setup import IntegrationSetupResult, setup_python_project
+
+
+@dataclass(frozen=True)
+class IntegrationTestResult:
+    run: IntegrationRun
+    setup: IntegrationSetupResult
+    command: tuple[str, ...]
+    exit_code: int
+    dry_run: bool
+
+
+def run_integration_test(
+    root: str | Path = ".",
+    *,
+    template: str = "tutorial-pastebin",
+    task: str | None = None,
+    all_tasks: bool = False,
+    keep: int | None = None,
+    setup_extras: tuple[str, ...] = ("pytest",),
+    skip_setup_validate: bool = False,
+    dry_run: bool = False,
+) -> IntegrationTestResult:
+    if task and all_tasks:
+        raise NightShiftError("Integration test error: use either --task or --all, not both.")
+    if not task and not all_tasks:
+        raise NightShiftError("Integration test error: provide --task or --all.")
+
+    run = create_integration_run(Path(root), template=template, keep=keep)
+    project = run.directory / "project"
+    setup = setup_python_project(
+        project,
+        extras=setup_extras,
+        validate=not skip_setup_validate,
+        dry_run=dry_run,
+    )
+    command = [str(setup.python), "-m", "nightshift.cli", "run", "--no-animation"]
+    if all_tasks:
+        command.append("--all")
+    else:
+        command.extend(["--task", task or ""])
+
+    exit_code = 0
+    if not dry_run:
+        completed = subprocess.run(command, cwd=project, text=True, encoding="utf-8", errors="replace")
+        exit_code = completed.returncode
+    return IntegrationTestResult(run, setup, tuple(command), exit_code, dry_run)
+
+
+def format_integration_test_result(result: IntegrationTestResult) -> str:
+    lines = [
+        f"Integration run: {result.run.directory}",
+        f"Project: {result.run.directory / 'project'}",
+        f"Venv: {result.run.venv_dir}",
+        f"Run command: {' '.join(result.command)}",
+        f"Exit code: {result.exit_code}",
+        f"Artifacts: {result.run.directory / 'project' / '.nightshift'}",
+    ]
+    if result.dry_run:
+        lines.insert(3, "Dry run: true")
+    return "\n".join(lines)
--- a/nightshift/pipeline.py
+++ b/nightshift/pipeline.py
@ -9,7 +9,7 @@ import subprocess

 from .agents import AgentExecutor
 from .artifacts import ArtifactStore
-from .commands import CommandExecutor
+from .commands import CommandExecutor, extract_test_file_paths, render_command_template
 from .config import COMMAND_STAGE_TYPES, NightShiftConfig, StageConfig
 from .context import ContextManager
 from .dependencies import diagnose_python_dependencies, format_dependency_diagnostic
@ -145,6 +145,12 @@ class PipelineRunner:
        index = 0
        final_status = "complete"
        final_reason = "Pipeline completed."
+        preflight_result = self._preflight_task(task, stages)
+        if preflight_result:
+            stage_results.append(preflight_result)
+            final_status = "failed"
+            final_reason = preflight_result.reason
+            index = len(stages)

        while index < len(stages):
            stage = stages[index]
@ -248,6 +254,13 @@ class PipelineRunner:
                    "retry-memory.md",
                    summarize_retry_memory(tuple(retry_memory)),
                )
+                if _repeated_protected_path_violation(tuple(retry_memory)):
+                    final_status = "failed"
+                    final_reason = (
+                        "Escalation policy stopped retries: implementation repeatedly "
+                        "attempted to modify paths outside the stage allowlist."
+                    )
+                    break
                decision = evaluate_retry_churn(
                    tuple(retry_memory),
                    retry_budget=self.config.pipeline.max_task_retries + 1,
@ -334,6 +347,45 @@ class PipelineRunner:
            reason=final_reason,
        )

+    def _preflight_task(self, task: Task, stages: list[StageConfig]) -> StageResult | None:
+        missing_paths: list[str] = []
+        for stage in stages:
+            if stage.type not in COMMAND_STAGE_TYPES:
+                continue
+            for command in stage.commands:
+                rendered = render_command_template(command, task.id)
+                for path_text in extract_test_file_paths(rendered):
+                    if not (self.config.project.root / path_text).exists():
+                        missing_paths.append(path_text)
+        if not missing_paths:
+            return None
+        unique_paths = tuple(dict.fromkeys(missing_paths))
+        details = "\n".join(f"- `{path}`" for path in unique_paths)
+        output_path = self.artifacts.write_stage_output(
+            task.id,
+            "preflight.md",
+            "\n".join(
+                [
+                    "# Task Preflight",
+                    "",
+                    "Status: fail",
+                    "Reason: configured task test file is missing.",
+                    "",
+                    "## Missing Files",
+                    "",
+                    details,
+                    "",
+                ]
+            ),
+        )
+        return StageResult(
+            "preflight",
+            "fail",
+            "Task preflight failed: configured task test file is missing: "
+            + ", ".join(unique_paths),
+            output_path=str(output_path.relative_to(self.config.project.root)),
+        )
+
    def run_tasks(self, tasks: list[Task] | tuple[Task, ...]) -> MultiTaskResult:
        self.artifacts.initialize_run()
        self.logger.bind(self.artifacts)
@ -1428,6 +1480,18 @@ def _extract_exit_code(text: str) -> int | None:
        return None


+def _repeated_protected_path_violation(entries: tuple[RetryMemoryEntry, ...]) -> bool:
+    recent = entries[-2:]
+    if len(recent) < 2:
+        return False
+    return all(_is_protected_path_violation(entry.cause) for entry in recent)
+
+
+def _is_protected_path_violation(text: str) -> bool:
+    lowered = text.lower()
+    return "not allowed for this stage" in lowered and "tests/" in lowered.replace("\\", "/")
+
+
 def format_aggregate_run_summary(results: list[PipelineResult], status: str, reason: str) -> str:
    lines = [
        "# Run Summary",
--- a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/debugger.md
+++ b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/debugger.md
@ -1,9 +1,11 @@
 You are the debugger agent for the NightShift pastebin tutorial.

 Diagnose failed attempts without editing files.
-Distinguish inaccurate generated tests from implementation bugs.
-If tests are inaccurate for the current task, recommend retrying `write_tests`.
+Distinguish fixed-test/template problems from implementation bugs.
+This tutorial uses fixed task tests and task-specific pytest commands. Do not recommend `write_tests` unless the configured pipeline actually has a `write_tests` stage.
+If a current task appears to lack tests, report a template or test-selection problem.
 If implementation is wrong, recommend the smallest implementation repair and name files that should not be modified.
+Implementation agents must not edit files under `tests/`.
 Return:
 - concise diagnosis
 - recommended next action
--- a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/implementer.md
+++ b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/implementer.md
@ -7,8 +7,10 @@ Do not add behavior for future tasks unless needed to satisfy the current tests.
 Use Flask and `sqlite3` from the Python standard library. Do not use SQLAlchemy, Flask-SQLAlchemy, or undeclared dependencies.
 Keep the public package name `pastebin_app`.
 Keep the public app entry point `create_app(database_path: str | None = None)`.
+Respect `database_path`; do not hard-code `snippets.db` when a database path is supplied.
 Tests should interact through HTTP routes and `create_app`, not through ORM/session globals.
 Do not use `app.before_first_request`; recent Flask versions removed it. Initialize required database tables inside `create_app` or inside the route helper before use.
+When adding columns to an existing sqlite table, handle existing databases idempotently with `ALTER TABLE` checks or a simple migration helper. `CREATE TABLE IF NOT EXISTS` does not add columns to an existing table.

 Output only complete file content blocks.
 Use one fenced block per file:
--- a/nightshift/project_templates/tutorial-pastebin/README.md
+++ b/nightshift/project_templates/tutorial-pastebin/README.md
@ -14,6 +14,12 @@ Or create an isolated integration sandbox from the NightShift repository root:
 python -m nightshift.cli integ-run --template tutorial-pastebin
 ```

+To create, set up, validate, and run one task in a single command:
+
+```bash
+python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001
+```
+
 To create the sandbox and set it up in one step:

 ```bash
@ -48,12 +54,8 @@ nightshift what-happened

 When running from an integration sandbox, the same commands are run inside `integ_runs/<timestamp>/project`.

-The pipeline uses model fallback ordering for implementation attempts:
-
-1. `qwen2.5-coder:14b`
-2. `carstenuhlig/omnicoder-9b`
-3. `deepseek-coder-v2:16b`
+The default pastebin pipeline uses `qwen3-coder:30b` for planning, implementation, debugging, test review, and final review. It intentionally does not use multi-candidate fallback; pastebin is the deterministic reliability harness.

 Telemetry artifacts record which agent/model handled each stage and estimate token usage.

-This template uses a TDD-oriented pipeline. It starts with a skeletal package, generates task-specific pytest tests from the current task acceptance criteria, reviews those tests for scope, and then implements only enough application code to pass them.
+This template uses fixed task-specific pytest files. The pipeline starts with a skeletal package, implements only the current task, runs `tests/test_{task_id_compact}.py`, and then reviews the result.
--- a/nightshift/project_templates/tutorial-pastebin/nightshift.yaml
+++ b/nightshift/project_templates/tutorial-pastebin/nightshift.yaml
@ -20,51 +20,49 @@ safety:
    - curl | bash

 experiment:
-  label: pastebin-model-fallback
-  prompt_variant: tdd-qwen-omnicoder-deepseek-v2
+  label: pastebin-qwen3-coder
+  prompt_variant: fixed-tests-qwen3-coder-30b-v1

 agents:
  planner:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.2
+    num_ctx: 8192
+    num_predict: 4096
    system_prompt: .nightshift/agents/planner.md

-  implementer_qwen:
+  implementer:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.1
+    num_ctx: 8192
+    num_predict: 4096
    system_prompt: .nightshift/agents/implementer.md

  test_writer:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.1
+    num_ctx: 8192
+    num_predict: 4096
    system_prompt: .nightshift/agents/test-writer.md

-  implementer_omnicoder:
-    backend: ollama
-    model: carstenuhlig/omnicoder-9b
-    temperature: 0.1
-    system_prompt: .nightshift/agents/implementer.md
-
-  implementer_deepseek:
-    backend: ollama
-    model: deepseek-coder-v2:16b
-    temperature: 0.1
-    system_prompt: .nightshift/agents/implementer.md
-
  debugger:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    role: debugger
    temperature: 0.1
+    num_ctx: 8192
+    num_predict: 4096
    system_prompt: .nightshift/agents/debugger.md

  reviewer:
    backend: ollama
-    model: qwen2.5-coder:14b
+    model: qwen3-coder:30b
    temperature: 0.1
+    num_ctx: 8192
+    num_predict: 4096
    system_prompt: .nightshift/agents/reviewer.md

 pipeline:
@ -87,10 +85,7 @@ pipeline:

    - id: implement
      type: file_writer
-      agent_pool:
-        - implementer_qwen
-        - implementer_omnicoder
-        - implementer_deepseek
+      agent: implementer
      output: proposed.patch

    - id: normalize
--- a/nightshift/project_templates/tutorial-pastebin/tests/test_task001.py
+++ b/nightshift/project_templates/tutorial-pastebin/tests/test_task001.py
@ -16,6 +16,7 @@ def test_create_snippet_returns_created_snippet_id(tmp_path):
    assert response.status_code == 201
    data = response.get_json()
    assert isinstance(data["id"], int)
+    assert (tmp_path / "snippets.db").exists()


 def test_view_snippet_returns_persisted_fields(tmp_path):
@ -38,6 +39,7 @@ def test_view_snippet_returns_persisted_fields(tmp_path):
        "title": "View me",
        "body": "stored body",
    }
+    assert (tmp_path / "snippets.db").exists()


 def test_view_missing_snippet_returns_404(tmp_path):
--- a/nightshift/project_templates/tutorial-pastebin/tests/test_task002.py
+++ b/nightshift/project_templates/tutorial-pastebin/tests/test_task002.py
@ -0,0 +1,50 @@
+from pastebin_app.app import create_app
+
+
+def test_create_snippet_accepts_optional_metadata(tmp_path):
+    app = create_app(database_path=str(tmp_path / "snippets.db"))
+    client = app.test_client()
+
+    response = client.post(
+        "/snippets",
+        json={
+            "title": "Tagged",
+            "body": "metadata body",
+            "language": "python",
+            "tags": ["alpha", "beta"],
+            "expires_at": "2030-01-01T00:00:00",
+        },
+    )
+
+    assert response.status_code == 201
+    assert isinstance(response.get_json()["id"], int)
+    assert (tmp_path / "snippets.db").exists()
+
+
+def test_view_snippet_returns_optional_metadata(tmp_path):
+    app = create_app(database_path=str(tmp_path / "snippets.db"))
+    client = app.test_client()
+
+    created = client.post(
+        "/snippets",
+        json={
+            "title": "Tagged",
+            "body": "metadata body",
+            "language": "python",
+            "tags": ["alpha", "beta"],
+            "expires_at": "2030-01-01T00:00:00",
+        },
+    ).get_json()
+
+    response = client.get(f"/snippets/{created['id']}")
+
+    assert response.status_code == 200
+    assert response.get_json() == {
+        "id": created["id"],
+        "title": "Tagged",
+        "body": "metadata body",
+        "language": "python",
+        "tags": ["alpha", "beta"],
+        "expires_at": "2030-01-01T00:00:00",
+    }
+    assert (tmp_path / "snippets.db").exists()
--- a/nightshift/project_templates/tutorial-pastebin/tests/test_task003.py
+++ b/nightshift/project_templates/tutorial-pastebin/tests/test_task003.py
@ -0,0 +1,47 @@
+from pastebin_app.app import create_app
+
+
+def _create(client, title, body, **metadata):
+    response = client.post("/snippets", json={"title": title, "body": body, **metadata})
+    assert response.status_code == 201
+    return response.get_json()["id"]
+
+
+def test_list_snippets_newest_first(tmp_path):
+    app = create_app(database_path=str(tmp_path / "snippets.db"))
+    client = app.test_client()
+
+    first_id = _create(client, "First", "older")
+    second_id = _create(client, "Second", "newer")
+
+    response = client.get("/snippets")
+
+    assert response.status_code == 200
+    ids = [snippet["id"] for snippet in response.get_json()]
+    assert ids[:2] == [second_id, first_id]
+
+
+def test_search_filters_by_title_or_body(tmp_path):
+    app = create_app(database_path=str(tmp_path / "snippets.db"))
+    client = app.test_client()
+    _create(client, "Python note", "ordinary body")
+    _create(client, "Other", "contains needle")
+
+    response = client.get("/snippets?q=python")
+    assert [snippet["title"] for snippet in response.get_json()] == ["Python note"]
+
+    response = client.get("/snippets?q=needle")
+    assert [snippet["title"] for snippet in response.get_json()] == ["Other"]
+
+
+def test_language_and_tag_filters(tmp_path):
+    app = create_app(database_path=str(tmp_path / "snippets.db"))
+    client = app.test_client()
+    _create(client, "Python", "body", language="python", tags=["code", "demo"])
+    _create(client, "Text", "body", language="text", tags=["notes"])
+
+    response = client.get("/snippets?language=python")
+    assert [snippet["title"] for snippet in response.get_json()] == ["Python"]
+
+    response = client.get("/snippets?tag=notes")
+    assert [snippet["title"] for snippet in response.get_json()] == ["Text"]
--- a/nightshift/project_templates/tutorial-pastebin/tests/test_task004.py
+++ b/nightshift/project_templates/tutorial-pastebin/tests/test_task004.py
@ -0,0 +1,43 @@
+from pastebin_app.app import create_app
+
+
+def test_expired_snippets_are_excluded_from_listing(tmp_path):
+    app = create_app(database_path=str(tmp_path / "snippets.db"))
+    client = app.test_client()
+    client.post(
+        "/snippets",
+        json={"title": "Expired", "body": "old", "expires_at": "2000-01-01T00:00:00"},
+    )
+    active = client.post(
+        "/snippets",
+        json={"title": "Active", "body": "new", "expires_at": "2999-01-01T00:00:00"},
+    ).get_json()
+
+    response = client.get("/snippets")
+
+    assert response.status_code == 200
+    assert [snippet["id"] for snippet in response.get_json()] == [active["id"]]
+
+
+def test_direct_lookup_of_expired_snippet_returns_410(tmp_path):
+    app = create_app(database_path=str(tmp_path / "snippets.db"))
+    client = app.test_client()
+    expired = client.post(
+        "/snippets",
+        json={"title": "Expired", "body": "old", "expires_at": "2000-01-01T00:00:00"},
+    ).get_json()
+
+    response = client.get(f"/snippets/{expired['id']}")
+
+    assert response.status_code == 410
+
+
+def test_non_expiring_snippet_remains_visible(tmp_path):
+    app = create_app(database_path=str(tmp_path / "snippets.db"))
+    client = app.test_client()
+    created = client.post("/snippets", json={"title": "Forever", "body": "body"}).get_json()
+
+    response = client.get(f"/snippets/{created['id']}")
+
+    assert response.status_code == 200
+    assert response.get_json()["title"] == "Forever"
--- a/nightshift/project_templates/tutorial-pastebin/tests/test_task005.py
+++ b/nightshift/project_templates/tutorial-pastebin/tests/test_task005.py
@ -0,0 +1,46 @@
+from pastebin_app.app import create_app
+
+
+def test_root_shows_snippet_list_html(tmp_path):
+    app = create_app(database_path=str(tmp_path / "snippets.db"))
+    client = app.test_client()
+    client.post("/snippets", json={"title": "Visible", "body": "body"})
+
+    response = client.get("/")
+
+    assert response.status_code == 200
+    assert "Visible" in response.get_data(as_text=True)
+
+
+def test_new_snippet_form_loads(tmp_path):
+    app = create_app(database_path=str(tmp_path / "snippets.db"))
+    client = app.test_client()
+
+    response = client.get("/new")
+
+    assert response.status_code == 200
+    html = response.get_data(as_text=True)
+    assert 'name="title"' in html
+    assert 'name="body"' in html
+    assert 'name="language"' in html
+    assert 'name="tags"' in html
+    assert 'name="expires_at"' in html
+
+
+def test_form_post_redirects_to_snippet_view(tmp_path):
+    app = create_app(database_path=str(tmp_path / "snippets.db"))
+    client = app.test_client()
+
+    response = client.post(
+        "/new",
+        data={
+            "title": "Form title",
+            "body": "Form body",
+            "language": "text",
+            "tags": "forms,html",
+            "expires_at": "",
+        },
+    )
+
+    assert response.status_code == 302
+    assert response.headers["Location"].endswith("/snippets/1")
--- a/nightshift/task_tests.py
+++ b/nightshift/task_tests.py
@ -0,0 +1,48 @@
+"""Task-specific test file validation."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from pathlib import Path
+
+from .commands import extract_test_file_paths, render_command_template
+from .config import COMMAND_STAGE_TYPES, NightShiftConfig
+from .tasks import Task
+
+
+@dataclass(frozen=True)
+class TaskTestCheck:
+    task_id: str
+    path: str
+    exists: bool
+
+
+def check_task_test_files(config: NightShiftConfig, tasks: tuple[Task, ...] | list[Task]) -> tuple[TaskTestCheck, ...]:
+    checks: list[TaskTestCheck] = []
+    for task in tasks:
+        seen: set[str] = set()
+        for stage in config.pipeline.stages:
+            if stage.type not in COMMAND_STAGE_TYPES:
+                continue
+            for command in stage.commands:
+                rendered = render_command_template(command, task.id)
+                for path_text in extract_test_file_paths(rendered):
+                    if path_text in seen:
+                        continue
+                    seen.add(path_text)
+                    checks.append(TaskTestCheck(task.id, path_text, (config.project.root / path_text).exists()))
+    return tuple(checks)
+
+
+def format_task_test_checks(checks: tuple[TaskTestCheck, ...]) -> str:
+    if not checks:
+        return "Task test files: no task-specific test paths detected."
+    lines = ["Task test files:"]
+    for check in checks:
+        status = "ok" if check.exists else "missing"
+        lines.append(f"- {check.task_id}: {check.path} ({status})")
+    return "\n".join(lines)
+
+
+def missing_task_test_paths(checks: tuple[TaskTestCheck, ...]) -> tuple[Path, ...]:
+    return tuple(Path(check.path) for check in checks if not check.exists)
--- a/tests/test_commands.py
+++ b/tests/test_commands.py
@ -6,6 +6,7 @@ from nightshift.artifacts import ArtifactStore
 from nightshift.commands import CommandExecutor
 from nightshift.commands import CommandRun, format_command_runs
 from nightshift.commands import _command_env
+from nightshift.commands import render_command_template
 from nightshift.config import SafetyConfig, StageConfig
 from nightshift.errors import CommandError
 import sys
@ -16,6 +17,13 @@ FAILING_COMMAND = 'python -c "import sys; print(\'bad\'); sys.exit(7)"'


 class CommandExecutorTests(unittest.TestCase):
+    def test_render_command_template_includes_task_id_variants(self) -> None:
+        command = "python -m pytest -q tests/test_{task_id_compact}.py # {task_id_slug} {task_id}"
+
+        rendered = render_command_template(command, "TASK-001")
+
+        self.assertEqual(rendered, "python -m pytest -q tests/test_task001.py # task_001 TASK-001")
+
    def test_passing_command_stage_returns_pass_and_writes_output(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            root = Path(directory)
@ -46,6 +54,33 @@ class CommandExecutorTests(unittest.TestCase):
            self.assertIn("Exit code: 0", output)
            self.assertIn("ok", output)

+    def test_command_stage_renders_task_id_before_allowlist_check(self) -> None:
+        with tempfile.TemporaryDirectory() as directory:
+            root = Path(directory)
+            artifacts = ArtifactStore(root, ".nightshift", run_id="test-run")
+            executor = CommandExecutor(
+                root,
+                SafetyConfig(
+                    require_clean_worktree=False,
+                    scoped_paths=(".",),
+                    allowed_commands=('python -c "print(\'{task_id_compact}\')"',),
+                    forbidden_commands=("rm -rf",),
+                ),
+                artifacts,
+            )
+            stage = StageConfig(
+                id="test",
+                type="command",
+                commands=('python -c "print(\'{task_id_compact}\')"',),
+                output="test-output.txt",
+            )
+
+            result = executor.run_stage(stage, "TASK-002")
+
+            self.assertEqual(result.status, "pass")
+            output = (root / result.output_path).read_text(encoding="utf-8")
+            self.assertIn("task002", output)
+
    def test_failing_command_stage_returns_fail_and_writes_output(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            root = Path(directory)
--- a/tests/test_config.py
+++ b/tests/test_config.py
@ -282,6 +282,27 @@ class ConfigTests(unittest.TestCase):

            self.assertEqual(config.agents["planner"].temperature, 0.2)

+    def test_agent_ollama_options_load(self) -> None:
+        with tempfile.TemporaryDirectory() as directory:
+            root = Path(directory)
+            init_project(root)
+            config_path = root / "nightshift.yaml"
+            config_path.write_text(
+                config_path.read_text(encoding="utf-8").replace(
+                    "    system_prompt: agents/planner.md",
+                    "    system_prompt: agents/planner.md\n    num_ctx: 8192\n    num_predict: 4096\n    seed: 1\n    stop:\n      - STOP",
+                    1,
+                ),
+                encoding="utf-8",
+            )
+
+            config = load_config(config_path)
+
+            self.assertEqual(config.agents["planner"].num_ctx, 8192)
+            self.assertEqual(config.agents["planner"].num_predict, 4096)
+            self.assertEqual(config.agents["planner"].seed, 1)
+            self.assertEqual(config.agents["planner"].stop, ("STOP",))
+
    def test_agent_temperature_must_be_number(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            root = Path(directory)
--- a/tests/test_init.py
+++ b/tests/test_init.py
@ -61,7 +61,7 @@ class InitProjectTests(unittest.TestCase):
        self.assertIn("tutorial-imageboard", available_templates())
        self.assertIn("tutorial-pastebin", available_templates())

-    def test_init_pastebin_template_creates_skeleton_and_model_fallback_config(self) -> None:
+    def test_init_pastebin_template_creates_skeleton_and_qwen3_config(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            root = Path(directory)

@ -78,11 +78,15 @@ class InitProjectTests(unittest.TestCase):
            self.assertIn("type: semantic_context", config)
            self.assertNotIn("id: write_tests", config)
            self.assertNotIn("id: review_tests", config)
-            self.assertIn("python -m pytest -q tests", config)
+            self.assertIn("python -m pytest -q tests/test_{task_id_compact}.py", config)
            self.assertIn("max_task_retries: 6", config)
-            self.assertIn("implementer_qwen", config)
-            self.assertIn("carstenuhlig/omnicoder-9b", config)
-            self.assertIn("deepseek-coder-v2:16b", config)
+            self.assertIn("implementer:", config)
+            self.assertIn("qwen3-coder:30b", config)
+            self.assertIn("num_ctx: 8192", config)
+            self.assertIn("num_predict: 4096", config)
+            self.assertNotIn("agent_pool:", config)
+            self.assertNotIn("carstenuhlig/omnicoder-9b", config)
+            self.assertNotIn("deepseek-coder-v2:16b", config)

    def test_pastebin_example_tutorial_docs_exist(self) -> None:
        root = Path(__file__).resolve().parents[1]
--- a/tests/test_integ_test.py
+++ b/tests/test_integ_test.py
@ -0,0 +1,51 @@
+from pathlib import Path
+import tempfile
+import unittest
+
+from nightshift.integ_report import build_integration_report, format_integration_report
+from nightshift.integ_test import format_integration_test_result, run_integration_test
+
+
+class IntegrationTestCommandTests(unittest.TestCase):
+    def test_run_integration_test_dry_run_builds_task_command(self) -> None:
+        with tempfile.TemporaryDirectory() as directory:
+            result = run_integration_test(
+                directory,
+                template="tutorial-pastebin",
+                task="TASK-001",
+                dry_run=True,
+            )
+
+            rendered = format_integration_test_result(result)
+            self.assertIn("Dry run: true", rendered)
+            self.assertIn("TASK-001", " ".join(result.command))
+            self.assertTrue((result.run.directory / "project" / "nightshift.yaml").exists())
+
+    def test_build_integration_report_summarizes_latest_task_summary(self) -> None:
+        with tempfile.TemporaryDirectory() as directory:
+            root = Path(directory)
+            summary = root / "integ_runs" / "20260521T000000.000000Z" / "project" / ".nightshift" / "runs" / "run1" / "tasks" / "TASK-001" / "run-summary.md"
+            summary.parent.mkdir(parents=True)
+            summary.write_text(
+                "\n".join(
+                    [
+                        "# Run Summary",
+                        "",
+                        "- Task: TASK-001",
+                        "- Status: complete",
+                        "- Retry count: 1",
+                        "- Reason: Done.",
+                    ]
+                ),
+                encoding="utf-8",
+            )
+
+            report = build_integration_report(root)
+            rendered = format_integration_report(report)
+
+            self.assertIn("TASK-001 complete after 1 retries", rendered)
+            self.assertIn("Reason: Done.", rendered)
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/tests/test_pipeline.py
+++ b/tests/test_pipeline.py
@ -105,6 +105,29 @@ class PipelineRunnerTests(unittest.TestCase):
            )
            self.assertIn("Modified Files", (root / ".nightshift" / "runs" / "test-run" / "run-summary.md").read_text(encoding="utf-8"))

+    def test_task_preflight_fails_when_task_specific_test_file_is_missing(self) -> None:
+        with tempfile.TemporaryDirectory() as directory:
+            root = Path(directory)
+            _write_common_files(root)
+            stages = (
+                StageConfig(
+                    id="test",
+                    type="command",
+                    commands=("python -m pytest -q tests/test_{task_id_compact}.py",),
+                    output="test-output.txt",
+                ),
+            )
+            config = make_config(root, stages, max_retries=0)
+            runner = PipelineRunner(config, ArtifactStore(root, ".nightshift", run_id="test-run"))
+            task = parse_tasks(TASK_MD)[0]
+
+            result = runner.run_task(task)
+
+            self.assertEqual(result.status, "failed")
+            self.assertIn("configured task test file is missing", result.reason)
+            task_dir = root / ".nightshift" / "runs" / "test-run" / "tasks" / task.id
+            self.assertIn("tests/test_task001.py", (task_dir / "preflight.md").read_text(encoding="utf-8"))
+
    def test_review_can_retry_implementation_until_limit(self) -> None:
        with tempfile.TemporaryDirectory() as directory:
            root = Path(directory)
--- a/tests/test_task_tests.py
+++ b/tests/test_task_tests.py
@ -0,0 +1,77 @@
+from pathlib import Path
+import tempfile
+import unittest
+
+from nightshift.config import validate_config
+from nightshift.task_tests import check_task_test_files, missing_task_test_paths
+from nightshift.tasks import parse_task_file
+
+
+class TaskTestValidationTests(unittest.TestCase):
+    def test_check_task_test_files_renders_task_placeholder(self) -> None:
+        with tempfile.TemporaryDirectory() as directory:
+            root = Path(directory)
+            (root / "agents").mkdir()
+            (root / "agents" / "planner.md").write_text("Prompt", encoding="utf-8")
+            (root / "tests").mkdir()
+            (root / "tests" / "test_task001.py").write_text("def test_ok():\n    assert True\n", encoding="utf-8")
+            (root / "nightshift.yaml").write_text(
+                "\n".join(
+                    [
+                        "project:",
+                        "  name: task-test-validation",
+                        "  root: .",
+                        "  task_file: tasks.md",
+                        "  artifact_dir: .nightshift",
+                        "",
+                        "safety:",
+                        "  require_clean_worktree: false",
+                        "  scoped_paths:",
+                        "    - .",
+                        "  allowed_commands:",
+                        "    - python -m pytest -q tests/test_{task_id_compact}.py",
+                        "  forbidden_commands:",
+                        "    - rm -rf",
+                        "",
+                        "agents:",
+                        "  planner:",
+                        "    backend: command",
+                        "    command: python -c \"print('ok')\"",
+                        "    system_prompt: agents/planner.md",
+                        "",
+                        "pipeline:",
+                        "  stages:",
+                        "    - id: test",
+                        "      type: command",
+                        "      commands:",
+                        "        - python -m pytest -q tests/test_{task_id_compact}.py",
+                    ]
+                ),
+                encoding="utf-8",
+            )
+            (root / "tasks.md").write_text(
+                """# Tasks
+
+- [ ] TASK-001: One
+
+Acceptance Criteria:
+- passes
+
+- [ ] TASK-002: Two
+
+Acceptance Criteria:
+- reports missing test
+""",
+                encoding="utf-8",
+            )
+
+            config = validate_config(root / "nightshift.yaml")
+            tasks = parse_task_file(config.project.root, config.project.task_file)
+            checks = check_task_test_files(config, tasks)
+
+            self.assertEqual([check.path for check in checks], ["tests/test_task001.py", "tests/test_task002.py"])
+            self.assertEqual(tuple(path.as_posix() for path in missing_task_test_paths(checks)), ("tests/test_task002.py",))
+
+
+if __name__ == "__main__":
+    unittest.main()