diff --git a/docs/bugfix_todo.md b/docs/bugfix_todo.md deleted file mode 100644 index c71d24b..0000000 --- a/docs/bugfix_todo.md +++ /dev/null @@ -1,195 +0,0 @@ -# Bugfix TODO - -## Some issues going with run --all - reason=Stage 'review' requested unknown next stage 'None'. Not every time. I think there's a pattern that is out of place here. Maybe it's related to the last task success? Or the last run? - - - -## Going from individual tasks to --all fails - -If you do nightshift run --task TASK-001 and then that completes and then you go to nightshift run --all it fails on blocked by missing dependencies: TASK-001 . I think this is because the tasks get reset at the top of the run, but there is something marking completion of TASK-001 requiring manual reset. - -run --all should start at the first not done task (seems like it does) - -## Some kind of tool install feature - -Continually fails on flask_sqlalchemy until I install that. - -## Tutorial need to include . directory for imageboard - -## Git status artifacts are noisy for non-git repositories - -Observed artifact: - -```text -# Git Status before - -Available: false -Exit code: 128 - -fatal: not a git repository (or any of the parent directories): .git -``` - -Current behavior: - -- NightShift continues when `require_clean_worktree: false`. -- `git-status-before.txt`, `git-status-after.txt`, and `diff.patch` may contain git errors. -- This is technically safe, but confusing for users running quickstart/demo projects outside git. - -Desired behavior: - -- Detect non-git repositories explicitly. -- Write a clearer artifact message such as: - -```text -Git repository: false -Clean-worktree enforcement: skipped because require_clean_worktree is false -Diff artifact: unavailable because project is not a git repository -``` - -- Avoid treating non-git as a scary-looking failure when clean worktree is not required. - -Acceptance criteria: - -- Non-git projects produce readable git artifacts without fatal-looking output. -- `require_clean_worktree: true` still fails safely in non-git projects. -- Reports mention that git metadata/diff is unavailable because the project is not a git repo. - -## Git safe.directory / ownership conflicts on Windows - -Observed context: - -- Git can report dubious ownership or safe-directory errors when a repo was created or managed by a different Windows user identity. -- This may happen when using GitHub Desktop, WSL, admin shells, or multiple Windows accounts. - -Current behavior: - -- NightShift records the raw git error in artifacts. -- If `require_clean_worktree: true`, NightShift blocks execution. -- If `require_clean_worktree: false`, NightShift continues but git status/diff artifacts can look like hard failures. - -Desired behavior: - -- Detect common `dubious ownership` / `safe.directory` messages. -- Write a clearer explanation in artifacts and reports. -- Suggest the exact remediation outside NightShift, for example: - -```powershell -git config --global --add safe.directory -``` - -Acceptance criteria: - -- Safe-directory failures are classified separately from ordinary git failures. -- Users get actionable guidance. -- NightShift does not attempt to change global git config automatically. - -## Clarify docs around git requirements - -Add to `QUICKSTART.md` and troubleshooting: - -- Git is optional when `require_clean_worktree: false`. -- Git is required for clean-worktree enforcement and useful diffs. -- Non-git projects can still run pipelines. -- Git ownership/safe-directory errors affect git artifacts, not core task execution, unless clean-worktree enforcement is enabled. - -## Console appears idle during long agent calls - -Current behavior: - -- Long Ollama calls can make `nightshift run` look frozen. -- Progress is only visible by inspecting `.nightshift/` artifacts or `ollama ps`. - -Desired behavior: - -- Print stage start/finish messages to the console. -- Include agent id, stage id, task id, and artifact path when available. -- Do not stream model output yet; just show lifecycle progress. - -Acceptance criteria: - -- User can tell which stage is running. -- Long-running model calls no longer look like a hung process. - -## Ollama output can make review stages fail if not structured - -Current behavior: - -- Review stages require `status: pass | fail | retry | escalate`. -- General-purpose model output may include prose before/after the structured fields. -- If no valid status is found, the review stage fails. - -Desired behavior: - -- Keep strict structured review parsing, but improve prompt templates and error messages. -- Artifact should clearly say the review output was unparseable and show the expected contract. - -Acceptance criteria: - -- Failed review parsing is easy to diagnose from `review.md` and `stage-results.md`. - -## `echo` fake agents do not behave consistently across shells - -Current behavior: - -- Starter templates use `command: echo`. -- Depending on shell/platform, `echo` may not preserve stdin or may only echo arguments. -- This can make fake agent artifacts less useful. - -Desired behavior: - -- Replace fake-agent defaults with small Python one-liners or documented fake-agent scripts. -- Keep examples cross-platform. - -Acceptance criteria: - -- Starter project produces predictable fake-agent output on Windows PowerShell/cmd and Unix shells. - -## `unittest discover` behavior depends on test package layout - -Current behavior: - -- Python 3.14 returned `NO TESTS RAN` with exit code 5 for an example project until `tests/__init__.py` was added. -- Users may hit the same issue in fresh target repos. - -Desired behavior: - -- Document this in troubleshooting. -- Consider making quickstart templates include `tests/__init__.py`. - -Acceptance criteria: - -- Quickstart test command works in a fresh copied example. -- Troubleshooting mentions what to do if `NO TESTS RAN` appears. - -## Task completion can mark tasks complete even if no source changed - -Current behavior: - -- A pipeline can pass with fake agents and passing tests, then mark the task complete. -- This is expected for fake/demo mode but surprising when users expect code edits. - -Desired behavior: - -- Add a warning when a task completes and git/diff detects no source changes, where git is available. -- Documentation should explain fake-agent mode vs editing-agent mode. - -Acceptance criteria: - -- Users are less likely to mistake artifact generation for code modification. - -## Dashboard requires Flask but dependency is optional - -Current behavior: - -- `nightshift web` fails with a helpful message if Flask is missing. -- README mentions `pip install flask`, but install extras are not defined. - -Desired behavior: - -- Add an optional dependency group such as `nightshift[web]` later. -- Keep graceful error behavior. - -Acceptance criteria: - -- Users have one documented install command for dashboard support. diff --git a/docs/config-reference.md b/docs/config-reference.md index b8faefa..97d787d 100644 --- a/docs/config-reference.md +++ b/docs/config-reference.md @@ -62,11 +62,19 @@ Ollama agent: ```yaml planner: backend: ollama - model: qwen2.5-coder:14b + model: qwen3-coder:30b base_url: http://localhost:11434 system_prompt: agents/planner.md + temperature: 0.2 + num_ctx: 8192 + num_predict: 4096 + seed: 1 + stop: + - STOP ``` +Optional Ollama generation options currently supported by NightShift are `temperature`, `num_ctx`, `num_predict`, `seed`, and `stop`. + ## `pipeline` - `max_task_retries`: task retry limit. @@ -76,6 +84,7 @@ planner: Command stage options: - `commands`: command strings. +- Command strings may use task placeholders: `{task_id}`, `{task_id_lower}`, `{task_id_slug}`, and `{task_id_compact}`. - `shell`: defaults to true. Set false for argv-style execution. - `timeout_seconds`: per-stage timeout override. - `working_dir`: command working directory inside project root. @@ -141,6 +150,12 @@ Create a local integration sandbox from the NightShift repository root: python -m nightshift.cli integ-run --template tutorial-pastebin ``` +Create, set up, validate, and run one task from the generated project directory: + +```bash +python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001 +``` + Set up the generated Python project: ```bash @@ -161,6 +176,12 @@ Preview commands without running them: python -m nightshift.cli integ-setup --project integ_runs//project --dry-run ``` +Summarize the latest integration artifact run: + +```bash +python -m nightshift.cli integ-report --latest +``` + To clean up old sandboxes before creating a new one, keep only the newest three existing runs: ```bash @@ -169,8 +190,4 @@ python -m nightshift.cli integ-run --template tutorial-pastebin --keep 3 ## Pastebin Tutorial -`nightshift init --template tutorial-pastebin` creates a small Flask snippet-hosting target with deterministic tests and incremental NightShift tasks. Its pipeline includes semantic context retrieval, telemetry, debugger support, and implementation fallback order: - -- `qwen2.5-coder:14b` -- `carstenuhlig/omnicoder-9b` -- `deepseek-coder-v2:16b` +`nightshift init --template tutorial-pastebin` creates a small Flask snippet-hosting target with deterministic tests and incremental NightShift tasks. Its pipeline includes semantic context retrieval, telemetry, debugger support, fixed task-specific tests, and a single default `qwen3-coder:30b` model path. diff --git a/docs/future_ideas.md b/docs/future_ideas.md new file mode 100644 index 0000000..342bc81 --- /dev/null +++ b/docs/future_ideas.md @@ -0,0 +1,17 @@ +### Future Ideas +Not to implement until we get successful long running runs. + +## I am realizing "templates" are abstracted from the user +* I think templates will be a first class citizen, a package for deployments, and a harness for performance tests +* These should live external to nightshift/project_templates as users will likely create their own +* one solution would be to reference two directories when looking up templates, builtin ones will be in nightshift/project_templates or users can define a templates directory in their nightshift config + +## nightshift config +* store user settings in ~/.nightshift/config.yaml +* things like templates folder (can also live here) +* maybe this is later + +## A way to easily make A/B tests to benchmark models? +* Right now I can do this manually, for example I want to run the tutorial-pastebin with qwen3.6:27b as the planner and qwen2.5-coder:14b as the coder, and another with qwen3.6:27b as both, etc. +* Maybe there is a way to make it easier to do that, possibly by creating a template that can be controlled by a larger multi-run file? +* This is probably for way later. diff --git a/docs/ideas.md b/docs/ideas.md new file mode 100644 index 0000000..6015575 --- /dev/null +++ b/docs/ideas.md @@ -0,0 +1,366 @@ +# Ideas TODO + +This file is now prioritized inline. Priority scale: + +- P0: do next; directly improves current feedback loop +- P1: important after the current loop is usable +- P2: useful, but only after basics are stable +- P3: defer or maybe reject + +## P0: Make Integration Tests Easy To Run + +Status: implemented. + +Implemented command: + +```powershell +python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001 +``` + +It creates the integration sandbox, sets up the venv, runs validation through setup, runs the task from the generated project directory, and prints the artifact root. Use `--dry-run` to preview the setup and task command. + +Running integration tests is still too manual. + +Current process: + +- install the current version of NightShift +- run `python -m nightshift.cli integ-run --template tutorial-pastebin --setup` +- copy the activation line from the output and run it +- `cd` into the generated directory +- run the task there, because running from the repo root does not find `nightshift.yaml` + +Recommendation: implement a wrapper command, not just a loose script. + +Target command: + +```powershell +python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001 +``` + +It should: + +1. create the integration run +2. set up the venv +3. install NightShift from the current checkout +4. run `nightshift validate` +5. run the selected task from the generated project directory +6. print final status and artifact path + +Useful variants: + +```powershell +python -m nightshift.cli integ-test --template tutorial-pastebin --all +python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-002 --keep 3 +``` + +The base-directory config issue may not be a core bug, but it is bad UX. The wrapper should handle `cwd` correctly. + +## P0/P1: Remove Multi-Candidate Workflow From Default Pastebin + +Status: implemented for the default pastebin template and tutorial example. + +Original idea: + +- The multi-candidate workflow does not add as much as expected. +- Keep it as an example, maybe `example-multiagent`. + +Recommendation: yes. Remove it from the default pastebin tutorial. + +Reason: + +- Pastebin is becoming the reliability harness. +- Multi-candidate fallback makes artifacts harder to reason about. +- It adds model variability while we are still debugging pipeline behavior. + +Better split: + +```text +tutorial-pastebin +tutorial-pastebin-multiagent +``` + +or: + +```text +examples/templates/multiagent-fallback +``` + +Default pastebin should be boring: + +```text +planner -> semantic_context -> context -> implement -> validate -> test -> review +``` + +Use one strong implementer first. Add fallback only in a separate experiment template. + +## P1: Add A Qwen3 / 30B Pastebin Variant + +Status: implemented as the default pastebin model path using `qwen3-coder:30b`. + +Original idea: + +- Use a non-coder model for planner roles. +- Try `qwen3.6:27b` for planning. +- Use `qwen3-coder:30b` for implementer and code-heavy roles. + +Recommendation: viable, but make this a variant, not the default. + +kass reply- No lets make this the default. the qwen3-coder:30b is fast now for me for some reason. + +Suggested template/config: + +```text +tutorial-pastebin-qwen3 +``` + +Possible role split: + +- planner: `qwen3.6:27b` +- reviewer/debugger: `qwen3.6:27b` +- implementer: `qwen3-coder:30b` or exact local 30B coder model name + +Important: confirm exact model names with: + +```powershell +ollama list +``` + +i did its `qwen3-coder:30b` + +Use 30B where it pays: + +- first implementation for hard tasks +- repair after concrete test failure +- schema/database changes +- multi-file changes + +Do not blindly make every stage 30B if it is slow. + +reply: Its not slow now!`qwen3-coder:30b` + +## P2: Expose More Model Parameters + +Status: implemented for the practical first set. + +Supported optional Ollama fields now include `num_ctx`, `num_predict`, `seed`, and `stop`, in addition to existing `temperature`. + +Original question: + +- What else besides temperature is available? +- Are any worth optimizing? + +Likely useful for Ollama: + +- `temperature` +- `num_ctx` +- `num_predict` +- `seed` +- `stop` +- maybe `top_p`, `top_k`, `repeat_penalty` + +Recommendation: add only a small practical set first. + +Useful config shape: + +```yaml +temperature: 0.1 +num_ctx: 8192 +num_predict: 4096 +seed: 1 +``` + +Most useful: + +- `num_ctx`: larger repo/task context +- `num_predict`: caps runaway output +- `seed`: reproducibility, if supported consistently +- `temperature`: already useful; keep low for code +- `stop`: could help enforce file-block or diff-only contracts + +Defer tuning `top_p`, `top_k`, and `repeat_penalty` unless a specific model needs it. + +reply: yup lets put this in the nightshift.yaml (optional parameters, if they arent in there that's fine, but we should offer them.) + +## P1: Add Test Governance For Generated Tests + +Original idea: + +- Have a test governance layer for when agents write tests. +- A reviewer validates alignment with acceptance criteria. + +Recommendation: yes, but only for generated-test mode. Do not put generated tests back into default pastebin yet. + +The previous failures proved test-writing agents will: + +- edit app code +- import nonexistent modules +- require undeclared dependencies +- inspect implementation internals +- write tests for future behavior + +Governance should be deterministic first, model-reviewed second. + +Deterministic checks: + +- test-writing stage may only touch `tests/` +- tests compile +- tests import only allowed public interfaces +- tests do not import undeclared dependencies +- tests do not define Flask routes or app implementation +- test names match current task id or current artifact +- no future-task keywords unless accepted by current task AC + +Then optional model reviewer checks acceptance-criteria alignment. + +## P2: Add A Test Analyzer Agent For TDD + +Original idea: + +- Analyze tests. +- Translate them into direct instructions for the implementer. +- Maybe implement using agent YAML definitions without new NightShift features. + +Recommendation: viable, but defer until generated tests are stable. + +Possible pipeline: + +```text +write_tests -> validate_tests -> analyze_tests -> implement +``` + +Analyzer output should be concrete: + +```text +Implementation requirements: +- create_app(database_path) must return a Flask app. +- POST /snippets must return 201 and JSON id. +- GET /snippets/ must return persisted fields. + +Do not modify: +- tests/test_task001.py +``` + +This may help smaller models, but it is another model output that can be wrong. Add it only after the fixed-test pipeline works through all pastebin tasks. + +## P2/P3: Add A Test Planner + +Original idea: + +- A test planner understands acceptance criteria and code. +- Provides input to the next stage about constraints and code, especially for non-TDD. + +Recommendation: maybe, but defer. + +This overlaps with: + +- planner +- test analyzer +- test governance + +Too many planning-ish stages can make the pipeline bloated and contradictory. + +If implemented later, keep it focused: + +```text +test_planner -> write_tests -> test_governance -> implement +``` + +For now, fold this idea into the future test governance/analyzer work. + +## P1: Add Fixed Tests For All Pastebin Tasks + +Status: mostly implemented in the template. + +Current fixed tests: + +```text +tests/test_task001.py +tests/test_task002.py +tests/test_task003.py +tests/test_task004.py +tests/test_task005.py +``` + +Important design: + +```yaml +python -m pytest -q tests/test_{task_id_compact}.py +``` + +This lets all future task tests exist without breaking earlier tasks. + +Next step: validate these through integration runs, one task at a time. + +## P1: Add `nightshift integ-report` + +Status: implemented as a first-pass artifact summarizer. + +New idea. + +Summarize latest integration run across tasks: + +```text +TASK-001 complete in 1 retry +TASK-002 failed at validate_patch +Root cause: protected tests modified +Artifacts: ... +``` + +Right now we inspect artifacts manually. NightShift should do more of that. + +Possible command: + +```powershell +python -m nightshift.cli integ-report --latest +``` + +## P1: Add Task-Test Preflight To `validate` + +Status: implemented. + +`nightshift validate` now renders task command placeholders for every task and fails early if a configured `tests/test_*.py` path is missing. + +Partially implemented at run time. + +Current behavior: + +- task command placeholders can render paths like `tests/test_task002.py` +- `run_task` preflight fails before invoking agents if the task-specific test file is missing + +Better behavior: + +```powershell +nightshift validate +``` + +should warn or fail: + +```text +TASK-003 expects tests/test_task003.py and it exists. +TASK-004 expects tests/test_task004.py and it exists. +``` + +This catches missing fixed tests earlier. + +## P2: Add Run Comparison + +New idea. + +Useful once comparing 14B vs 30B: + +```powershell +nightshift compare-runs --latest 5 +``` + +Show: + +- model +- task +- retries +- failure stage +- final reason +- runtime +- token estimate + +This should come after `integ-test` and `integ-report`. + diff --git a/examples/tutorial/03-pastebin/README.md b/examples/tutorial/03-pastebin/README.md index fdd9c92..ec07888 100644 --- a/examples/tutorial/03-pastebin/README.md +++ b/examples/tutorial/03-pastebin/README.md @@ -1,4 +1,4 @@ -# Tutorial 03: Pastebin With Model Fallback And Telemetry +# Tutorial 03: Pastebin With Fixed Tests And Telemetry This tutorial uses the `tutorial-pastebin` template: a small Flask snippet-hosting service designed for deterministic NightShift orchestration tests. @@ -19,6 +19,12 @@ For an isolated local integration run, use the integration sandbox command from python -m nightshift.cli integ-run --template tutorial-pastebin ``` +To create, set up, validate, and run one task in a single command: + +```bash +python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001 +``` + To create the sandbox and set up the Python project immediately: ```bash @@ -57,7 +63,7 @@ pyproject.toml README.md ``` -The template includes a tiny Flask `create_app(database_path=None)` scaffold and fixed `TASK-001` tests. The default tutorial pipeline asks the implementation agent to make those deterministic tests pass before review. +The template includes a tiny Flask `create_app(database_path=None)` scaffold and fixed tests for each tutorial task. The default tutorial pipeline asks the implementation agent to make only the current task's deterministic tests pass before review. ## Prerequisites @@ -73,26 +79,22 @@ Install target dependencies: python -m pip install -e . pytest flask ``` -Install and start Ollama, then pull the fallback models you want available: +Install and start Ollama, then pull the default pastebin model: ```bash -ollama pull qwen2.5-coder:14b -ollama pull carstenuhlig/omnicoder-9b -ollama pull deepseek-coder-v2:16b +ollama pull qwen3-coder:30b ollama list ``` NightShift uses Ollama's local HTTP API, normally at `http://localhost:11434`. -## Model Fallback +## Model -The implementation stage uses this fallback order: +The default pastebin pipeline uses one strong local coder model: -1. `qwen2.5-coder:14b` -2. `carstenuhlig/omnicoder-9b` -3. `deepseek-coder-v2:16b` +- `qwen3-coder:30b` -NightShift records which agent/model handled each stage in `telemetry-summary.md`. +NightShift records which agent/model handled each stage in `telemetry-summary.md`. Multi-candidate fallback belongs in a separate experiment template, not the default pastebin reliability harness. ## TDD Pipeline diff --git a/examples/tutorial/03-pastebin/nightshift.yaml b/examples/tutorial/03-pastebin/nightshift.yaml index 76a8dfc..871ebf5 100644 --- a/examples/tutorial/03-pastebin/nightshift.yaml +++ b/examples/tutorial/03-pastebin/nightshift.yaml @@ -20,51 +20,49 @@ safety: - curl | bash experiment: - label: pastebin-model-fallback - prompt_variant: tdd-qwen-omnicoder-deepseek-v2 + label: pastebin-qwen3-coder + prompt_variant: fixed-tests-qwen3-coder-30b-v1 agents: planner: backend: ollama - model: qwen2.5-coder:14b + model: qwen3-coder:30b temperature: 0.2 + num_ctx: 8192 + num_predict: 4096 system_prompt: .nightshift/agents/planner.md - implementer_qwen: + implementer: backend: ollama - model: qwen2.5-coder:14b + model: qwen3-coder:30b temperature: 0.1 + num_ctx: 8192 + num_predict: 4096 system_prompt: .nightshift/agents/implementer.md test_writer: backend: ollama - model: qwen2.5-coder:14b + model: qwen3-coder:30b temperature: 0.1 + num_ctx: 8192 + num_predict: 4096 system_prompt: .nightshift/agents/test-writer.md - implementer_omnicoder: - backend: ollama - model: carstenuhlig/omnicoder-9b - temperature: 0.1 - system_prompt: .nightshift/agents/implementer.md - - implementer_deepseek: - backend: ollama - model: deepseek-coder-v2:16b - temperature: 0.1 - system_prompt: .nightshift/agents/implementer.md - debugger: backend: ollama - model: qwen2.5-coder:14b + model: qwen3-coder:30b role: debugger temperature: 0.1 + num_ctx: 8192 + num_predict: 4096 system_prompt: .nightshift/agents/debugger.md reviewer: backend: ollama - model: qwen2.5-coder:14b + model: qwen3-coder:30b temperature: 0.1 + num_ctx: 8192 + num_predict: 4096 system_prompt: .nightshift/agents/reviewer.md pipeline: @@ -87,10 +85,7 @@ pipeline: - id: implement type: file_writer - agent_pool: - - implementer_qwen - - implementer_omnicoder - - implementer_deepseek + agent: implementer output: proposed.patch - id: normalize diff --git a/nightshift/agents.py b/nightshift/agents.py index d27031c..0c09ab7 100644 --- a/nightshift/agents.py +++ b/nightshift/agents.py @@ -228,8 +228,9 @@ class AgentExecutor: "prompt": prompt, "stream": False, } - if agent.temperature is not None: - body["options"] = {"temperature": agent.temperature} + options = _ollama_options(agent) + if options: + body["options"] = options headers = {"Content-Type": "application/json"} started = time.monotonic() self.logger.event( @@ -395,6 +396,21 @@ def build_prompt_bundle( ) +def _ollama_options(agent: AgentConfig) -> dict[str, object]: + options: dict[str, object] = {} + if agent.temperature is not None: + options["temperature"] = agent.temperature + if agent.num_ctx is not None: + options["num_ctx"] = agent.num_ctx + if agent.num_predict is not None: + options["num_predict"] = agent.num_predict + if agent.seed is not None: + options["seed"] = agent.seed + if agent.stop: + options["stop"] = list(agent.stop) + return options + + def _coerce_output(value: str | bytes | None) -> str: if value is None: return "" diff --git a/nightshift/cli.py b/nightshift/cli.py index 2114661..859d0e6 100644 --- a/nightshift/cli.py +++ b/nightshift/cli.py @@ -7,13 +7,16 @@ from pathlib import Path import sys from .config import validate_config -from .errors import NightShiftError +from .errors import ConfigError, NightShiftError from .init import available_templates, init_project from .integ import create_integration_run +from .integ_report import build_integration_report, format_integration_report from .integ_setup import format_setup_result, setup_python_project +from .integ_test import format_integration_test_result, run_integration_test from .pipeline import PipelineRunner from .runlog import RunLogger from .status import build_status, format_status +from .task_tests import check_task_test_files, format_task_test_checks, missing_task_test_paths from .terminal import HOTDOG_ANIMATIONS, TerminalAnimation, format_banner, style_text from .tasks import ( ensure_dependencies_satisfied, @@ -105,6 +108,33 @@ def build_parser() -> argparse.ArgumentParser: help="Print --setup commands without running them.", ) + integ_test_parser = subparsers.add_parser( + "integ-test", + help="Create, set up, validate, and run an integration template task.", + ) + integ_test_parser.add_argument("--root", default=".", help="Repository root where integ_runs/ is created.") + integ_test_parser.add_argument( + "--template", + default="tutorial-pastebin", + choices=available_templates(), + help="Template to initialize inside the sandbox.", + ) + integ_test_parser.add_argument("--task", help="Specific task id to run.") + integ_test_parser.add_argument("--all", action="store_true", help="Run all runnable incomplete tasks.") + integ_test_parser.add_argument("--keep", type=int, help="Keep only the newest N old integration runs before creating a new one.") + integ_test_parser.add_argument( + "--setup-extra", + action="append", + default=["pytest"], + help="Extra package to install during setup. May be repeated. Defaults to pytest.", + ) + integ_test_parser.add_argument("--setup-skip-validate", action="store_true", help="Skip validation during setup.") + integ_test_parser.add_argument("--dry-run", action="store_true", help="Print commands without running setup or tasks.") + + integ_report_parser = subparsers.add_parser("integ-report", help="Summarize the latest integration run.") + integ_report_parser.add_argument("--root", default=".", help="Repository root where integ_runs/ is located.") + integ_report_parser.add_argument("--latest", action="store_true", help="Report the latest integration run.") + setup_parser = subparsers.add_parser( "integ-setup", help="Set up a Python integration project venv and dependencies.", @@ -160,12 +190,18 @@ def main(argv: list[str] | None = None) -> int: config = validate_config(args.config) tasks = parse_task_file(config.project.root, config.project.task_file) validate_task_dependencies(tasks) + task_test_checks = check_task_test_files(config, tasks) + missing_task_tests = missing_task_test_paths(task_test_checks) + if missing_task_tests: + details = format_task_test_checks(task_test_checks) + raise ConfigError(f"Config error: missing configured task test files.\n{details}") incomplete = sum(1 for task in tasks if not task.completed) print(f"Config valid: {config.path}") print(f"Project: {config.project.name}") print(f"Stages: {len(config.pipeline.stages)}") print(f"Tasks: {len(tasks)}") print(f"Incomplete tasks: {incomplete}") + print(format_task_test_checks(task_test_checks)) return 0 if args.command == "run": @@ -256,6 +292,25 @@ def main(argv: list[str] | None = None) -> int: print(format_setup_result(result)) return 0 + if args.command == "integ-test": + result = run_integration_test( + args.root, + template=args.template, + task=args.task, + all_tasks=args.all, + keep=args.keep, + setup_extras=tuple(args.setup_extra or ()), + skip_setup_validate=args.setup_skip_validate, + dry_run=args.dry_run, + ) + print(format_integration_test_result(result)) + return result.exit_code + + if args.command == "integ-report": + report = build_integration_report(args.root, latest=True) + print(format_integration_report(report)) + return 0 + except NightShiftError as exc: print(str(exc), file=sys.stderr) return 1 diff --git a/nightshift/commands.py b/nightshift/commands.py index ef9312d..eacf05d 100644 --- a/nightshift/commands.py +++ b/nightshift/commands.py @@ -5,6 +5,7 @@ from __future__ import annotations from dataclasses import dataclass import os from pathlib import Path +import re import shlex import subprocess import sys @@ -68,11 +69,16 @@ class CommandExecutor: command_index=index, command=command, ) + rendered_command = render_command_template(command, task_id) + rendered_allowed_commands = tuple( + render_command_template(allowed, task_id) for allowed in self.safety.allowed_commands + ) run = self.run_command( - command, + rendered_command, shell=stage.shell, timeout_seconds=stage.timeout_seconds, working_dir=stage.working_dir, + allowed_commands=rendered_allowed_commands, ) runs.append(run) self.logger.event( @@ -120,11 +126,12 @@ class CommandExecutor: shell: bool = True, timeout_seconds: int | None = None, working_dir: Path | None = None, + allowed_commands: tuple[str, ...] | None = None, ) -> CommandRun: try: normalized = ensure_command_allowed( command, - self.safety.allowed_commands, + allowed_commands if allowed_commands is not None else self.safety.allowed_commands, self.safety.forbidden_commands, ) except SafetyError as exc: @@ -210,6 +217,27 @@ def format_command_runs(stage_id: str, runs: list[CommandRun]) -> str: return "\n".join(lines) +def render_command_template(command: str, task_id: str) -> str: + task_id_lower = task_id.lower() + task_id_slug = task_id_lower.replace("-", "_") + task_id_compact = task_id_lower.replace("-", "") + return command.format( + task_id=task_id, + task_id_lower=task_id_lower, + task_id_slug=task_id_slug, + task_id_compact=task_id_compact, + ) + + +def extract_test_file_paths(command: str) -> tuple[str, ...]: + paths: list[str] = [] + for match in re.finditer(r"(?|&;]+\.py)", command): + path = match.group(1).replace("\\", "/") + if path not in paths: + paths.append(path) + return tuple(paths) + + def _coerce_output(value: str | bytes | None) -> str: if value is None: return "" diff --git a/nightshift/config.py b/nightshift/config.py index 274a167..ade36d2 100644 --- a/nightshift/config.py +++ b/nightshift/config.py @@ -46,6 +46,10 @@ class AgentConfig: temperature: float | None = None base_url: str | None = None api_key_env: str | None = None + num_ctx: int | None = None + num_predict: int | None = None + seed: int | None = None + stop: tuple[str, ...] = () @dataclass(frozen=True) @@ -207,10 +211,18 @@ def parse_config(raw: dict[str, Any], config_path: Path) -> NightShiftConfig: agent_raw.get("temperature"), f"agents.{agent_id}.temperature", ) + num_ctx = _optional_int_or_none(agent_raw.get("num_ctx"), f"agents.{agent_id}.num_ctx") + num_predict = _optional_int_or_none(agent_raw.get("num_predict"), f"agents.{agent_id}.num_predict") + seed = _optional_int_or_none(agent_raw.get("seed"), f"agents.{agent_id}.seed") + stop = _string_tuple(agent_raw.get("stop", []), f"agents.{agent_id}.stop") if temperature is not None and temperature < 0: raise ConfigError( f"Config error: agents.{agent_id}.temperature must be zero or greater." ) + if num_ctx is not None and num_ctx <= 0: + raise ConfigError(f"Config error: agents.{agent_id}.num_ctx must be greater than zero.") + if num_predict is not None and num_predict <= 0: + raise ConfigError(f"Config error: agents.{agent_id}.num_predict must be greater than zero.") if backend not in {"command", "ollama", "openai_compatible"}: raise ConfigError( f"Config error: agent '{agent_id}' uses unsupported backend '{backend}'. " @@ -243,6 +255,10 @@ def parse_config(raw: dict[str, Any], config_path: Path) -> NightShiftConfig: temperature=temperature, base_url=base_url, api_key_env=api_key_env, + num_ctx=num_ctx, + num_predict=num_predict, + seed=seed, + stop=stop, ) experiment_raw = raw.get("experiment", {}) diff --git a/nightshift/integ_report.py b/nightshift/integ_report.py new file mode 100644 index 0000000..88e3959 --- /dev/null +++ b/nightshift/integ_report.py @@ -0,0 +1,71 @@ +"""Summarize integration run artifacts.""" + +from __future__ import annotations + +from dataclasses import dataclass +from pathlib import Path +import re + +from .errors import NightShiftError + + +@dataclass(frozen=True) +class IntegrationReport: + integration_run: Path + nightshift_run: Path | None + lines: tuple[str, ...] + + +def build_integration_report(root: str | Path = ".", *, latest: bool = True) -> IntegrationReport: + base = Path(root).resolve() / "integ_runs" + if not base.exists(): + raise NightShiftError(f"Integration report error: no integ_runs directory found: {base}") + runs = sorted((path for path in base.iterdir() if path.is_dir()), key=lambda path: path.name, reverse=True) + if not runs: + raise NightShiftError(f"Integration report error: no integration runs found under: {base}") + integration_run = runs[0] if latest else runs[0] + artifacts_root = integration_run / "project" / ".nightshift" / "runs" + if not artifacts_root.exists(): + return IntegrationReport( + integration_run, + None, + ("No NightShift run artifacts found. Setup may have failed before task execution.",), + ) + nightshift_runs = sorted((path for path in artifacts_root.iterdir() if path.is_dir()), key=lambda path: path.name, reverse=True) + if not nightshift_runs: + return IntegrationReport(integration_run, None, ("No NightShift run directories found.",)) + nightshift_run = nightshift_runs[0] + summaries = sorted(nightshift_run.glob("tasks/*/run-summary.md")) + if not summaries and (nightshift_run / "run-summary.md").exists(): + summaries = [nightshift_run / "run-summary.md"] + lines = [_summarize_run_summary(path, integration_run) for path in summaries] + return IntegrationReport(integration_run, nightshift_run, tuple(lines or ("No task summaries found.",))) + + +def format_integration_report(report: IntegrationReport) -> str: + lines = [f"Integration run: {report.integration_run}"] + if report.nightshift_run is not None: + lines.append(f"NightShift run: {report.nightshift_run}") + lines.append("") + lines.extend(f"- {line}" for line in report.lines) + return "\n".join(lines) + + +def _summarize_run_summary(path: Path, integration_run: Path) -> str: + text = path.read_text(encoding="utf-8", errors="replace") + task = _field(text, "Task") or path.parent.name + status = _field(text, "Status") or "unknown" + retries = _field(text, "Retry count") or "unknown" + reason = _field(text, "Reason") or "no reason recorded" + try: + relative = path.relative_to(integration_run) + except ValueError: + relative = path + return f"{task} {status} after {retries} retries. Reason: {reason}. Artifacts: {relative.parent}" + + +def _field(text: str, name: str) -> str | None: + match = re.search(rf"^- {re.escape(name)}:\s*(.+)$", text, flags=re.MULTILINE) + if not match: + return None + return match.group(1).strip() diff --git a/nightshift/integ_test.py b/nightshift/integ_test.py new file mode 100644 index 0000000..d9b355e --- /dev/null +++ b/nightshift/integ_test.py @@ -0,0 +1,71 @@ +"""End-to-end integration test wrapper.""" + +from __future__ import annotations + +from dataclasses import dataclass +from pathlib import Path +import subprocess + +from .errors import NightShiftError +from .integ import IntegrationRun, create_integration_run +from .integ_setup import IntegrationSetupResult, setup_python_project + + +@dataclass(frozen=True) +class IntegrationTestResult: + run: IntegrationRun + setup: IntegrationSetupResult + command: tuple[str, ...] + exit_code: int + dry_run: bool + + +def run_integration_test( + root: str | Path = ".", + *, + template: str = "tutorial-pastebin", + task: str | None = None, + all_tasks: bool = False, + keep: int | None = None, + setup_extras: tuple[str, ...] = ("pytest",), + skip_setup_validate: bool = False, + dry_run: bool = False, +) -> IntegrationTestResult: + if task and all_tasks: + raise NightShiftError("Integration test error: use either --task or --all, not both.") + if not task and not all_tasks: + raise NightShiftError("Integration test error: provide --task or --all.") + + run = create_integration_run(Path(root), template=template, keep=keep) + project = run.directory / "project" + setup = setup_python_project( + project, + extras=setup_extras, + validate=not skip_setup_validate, + dry_run=dry_run, + ) + command = [str(setup.python), "-m", "nightshift.cli", "run", "--no-animation"] + if all_tasks: + command.append("--all") + else: + command.extend(["--task", task or ""]) + + exit_code = 0 + if not dry_run: + completed = subprocess.run(command, cwd=project, text=True, encoding="utf-8", errors="replace") + exit_code = completed.returncode + return IntegrationTestResult(run, setup, tuple(command), exit_code, dry_run) + + +def format_integration_test_result(result: IntegrationTestResult) -> str: + lines = [ + f"Integration run: {result.run.directory}", + f"Project: {result.run.directory / 'project'}", + f"Venv: {result.run.venv_dir}", + f"Run command: {' '.join(result.command)}", + f"Exit code: {result.exit_code}", + f"Artifacts: {result.run.directory / 'project' / '.nightshift'}", + ] + if result.dry_run: + lines.insert(3, "Dry run: true") + return "\n".join(lines) diff --git a/nightshift/pipeline.py b/nightshift/pipeline.py index 78f9f22..409052e 100644 --- a/nightshift/pipeline.py +++ b/nightshift/pipeline.py @@ -9,7 +9,7 @@ import subprocess from .agents import AgentExecutor from .artifacts import ArtifactStore -from .commands import CommandExecutor +from .commands import CommandExecutor, extract_test_file_paths, render_command_template from .config import COMMAND_STAGE_TYPES, NightShiftConfig, StageConfig from .context import ContextManager from .dependencies import diagnose_python_dependencies, format_dependency_diagnostic @@ -145,6 +145,12 @@ class PipelineRunner: index = 0 final_status = "complete" final_reason = "Pipeline completed." + preflight_result = self._preflight_task(task, stages) + if preflight_result: + stage_results.append(preflight_result) + final_status = "failed" + final_reason = preflight_result.reason + index = len(stages) while index < len(stages): stage = stages[index] @@ -248,6 +254,13 @@ class PipelineRunner: "retry-memory.md", summarize_retry_memory(tuple(retry_memory)), ) + if _repeated_protected_path_violation(tuple(retry_memory)): + final_status = "failed" + final_reason = ( + "Escalation policy stopped retries: implementation repeatedly " + "attempted to modify paths outside the stage allowlist." + ) + break decision = evaluate_retry_churn( tuple(retry_memory), retry_budget=self.config.pipeline.max_task_retries + 1, @@ -334,6 +347,45 @@ class PipelineRunner: reason=final_reason, ) + def _preflight_task(self, task: Task, stages: list[StageConfig]) -> StageResult | None: + missing_paths: list[str] = [] + for stage in stages: + if stage.type not in COMMAND_STAGE_TYPES: + continue + for command in stage.commands: + rendered = render_command_template(command, task.id) + for path_text in extract_test_file_paths(rendered): + if not (self.config.project.root / path_text).exists(): + missing_paths.append(path_text) + if not missing_paths: + return None + unique_paths = tuple(dict.fromkeys(missing_paths)) + details = "\n".join(f"- `{path}`" for path in unique_paths) + output_path = self.artifacts.write_stage_output( + task.id, + "preflight.md", + "\n".join( + [ + "# Task Preflight", + "", + "Status: fail", + "Reason: configured task test file is missing.", + "", + "## Missing Files", + "", + details, + "", + ] + ), + ) + return StageResult( + "preflight", + "fail", + "Task preflight failed: configured task test file is missing: " + + ", ".join(unique_paths), + output_path=str(output_path.relative_to(self.config.project.root)), + ) + def run_tasks(self, tasks: list[Task] | tuple[Task, ...]) -> MultiTaskResult: self.artifacts.initialize_run() self.logger.bind(self.artifacts) @@ -1428,6 +1480,18 @@ def _extract_exit_code(text: str) -> int | None: return None +def _repeated_protected_path_violation(entries: tuple[RetryMemoryEntry, ...]) -> bool: + recent = entries[-2:] + if len(recent) < 2: + return False + return all(_is_protected_path_violation(entry.cause) for entry in recent) + + +def _is_protected_path_violation(text: str) -> bool: + lowered = text.lower() + return "not allowed for this stage" in lowered and "tests/" in lowered.replace("\\", "/") + + def format_aggregate_run_summary(results: list[PipelineResult], status: str, reason: str) -> str: lines = [ "# Run Summary", diff --git a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/debugger.md b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/debugger.md index 634b1b1..8d6a542 100644 --- a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/debugger.md +++ b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/debugger.md @@ -1,9 +1,11 @@ You are the debugger agent for the NightShift pastebin tutorial. Diagnose failed attempts without editing files. -Distinguish inaccurate generated tests from implementation bugs. -If tests are inaccurate for the current task, recommend retrying `write_tests`. +Distinguish fixed-test/template problems from implementation bugs. +This tutorial uses fixed task tests and task-specific pytest commands. Do not recommend `write_tests` unless the configured pipeline actually has a `write_tests` stage. +If a current task appears to lack tests, report a template or test-selection problem. If implementation is wrong, recommend the smallest implementation repair and name files that should not be modified. +Implementation agents must not edit files under `tests/`. Return: - concise diagnosis - recommended next action diff --git a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/implementer.md b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/implementer.md index c466ddc..e7416fe 100644 --- a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/implementer.md +++ b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/implementer.md @@ -7,8 +7,10 @@ Do not add behavior for future tasks unless needed to satisfy the current tests. Use Flask and `sqlite3` from the Python standard library. Do not use SQLAlchemy, Flask-SQLAlchemy, or undeclared dependencies. Keep the public package name `pastebin_app`. Keep the public app entry point `create_app(database_path: str | None = None)`. +Respect `database_path`; do not hard-code `snippets.db` when a database path is supplied. Tests should interact through HTTP routes and `create_app`, not through ORM/session globals. Do not use `app.before_first_request`; recent Flask versions removed it. Initialize required database tables inside `create_app` or inside the route helper before use. +When adding columns to an existing sqlite table, handle existing databases idempotently with `ALTER TABLE` checks or a simple migration helper. `CREATE TABLE IF NOT EXISTS` does not add columns to an existing table. Output only complete file content blocks. Use one fenced block per file: diff --git a/nightshift/project_templates/tutorial-pastebin/README.md b/nightshift/project_templates/tutorial-pastebin/README.md index b18df22..3999b9c 100644 --- a/nightshift/project_templates/tutorial-pastebin/README.md +++ b/nightshift/project_templates/tutorial-pastebin/README.md @@ -14,6 +14,12 @@ Or create an isolated integration sandbox from the NightShift repository root: python -m nightshift.cli integ-run --template tutorial-pastebin ``` +To create, set up, validate, and run one task in a single command: + +```bash +python -m nightshift.cli integ-test --template tutorial-pastebin --task TASK-001 +``` + To create the sandbox and set it up in one step: ```bash @@ -48,12 +54,8 @@ nightshift what-happened When running from an integration sandbox, the same commands are run inside `integ_runs//project`. -The pipeline uses model fallback ordering for implementation attempts: - -1. `qwen2.5-coder:14b` -2. `carstenuhlig/omnicoder-9b` -3. `deepseek-coder-v2:16b` +The default pastebin pipeline uses `qwen3-coder:30b` for planning, implementation, debugging, test review, and final review. It intentionally does not use multi-candidate fallback; pastebin is the deterministic reliability harness. Telemetry artifacts record which agent/model handled each stage and estimate token usage. -This template uses a TDD-oriented pipeline. It starts with a skeletal package, generates task-specific pytest tests from the current task acceptance criteria, reviews those tests for scope, and then implements only enough application code to pass them. +This template uses fixed task-specific pytest files. The pipeline starts with a skeletal package, implements only the current task, runs `tests/test_{task_id_compact}.py`, and then reviews the result. diff --git a/nightshift/project_templates/tutorial-pastebin/nightshift.yaml b/nightshift/project_templates/tutorial-pastebin/nightshift.yaml index 76a8dfc..871ebf5 100644 --- a/nightshift/project_templates/tutorial-pastebin/nightshift.yaml +++ b/nightshift/project_templates/tutorial-pastebin/nightshift.yaml @@ -20,51 +20,49 @@ safety: - curl | bash experiment: - label: pastebin-model-fallback - prompt_variant: tdd-qwen-omnicoder-deepseek-v2 + label: pastebin-qwen3-coder + prompt_variant: fixed-tests-qwen3-coder-30b-v1 agents: planner: backend: ollama - model: qwen2.5-coder:14b + model: qwen3-coder:30b temperature: 0.2 + num_ctx: 8192 + num_predict: 4096 system_prompt: .nightshift/agents/planner.md - implementer_qwen: + implementer: backend: ollama - model: qwen2.5-coder:14b + model: qwen3-coder:30b temperature: 0.1 + num_ctx: 8192 + num_predict: 4096 system_prompt: .nightshift/agents/implementer.md test_writer: backend: ollama - model: qwen2.5-coder:14b + model: qwen3-coder:30b temperature: 0.1 + num_ctx: 8192 + num_predict: 4096 system_prompt: .nightshift/agents/test-writer.md - implementer_omnicoder: - backend: ollama - model: carstenuhlig/omnicoder-9b - temperature: 0.1 - system_prompt: .nightshift/agents/implementer.md - - implementer_deepseek: - backend: ollama - model: deepseek-coder-v2:16b - temperature: 0.1 - system_prompt: .nightshift/agents/implementer.md - debugger: backend: ollama - model: qwen2.5-coder:14b + model: qwen3-coder:30b role: debugger temperature: 0.1 + num_ctx: 8192 + num_predict: 4096 system_prompt: .nightshift/agents/debugger.md reviewer: backend: ollama - model: qwen2.5-coder:14b + model: qwen3-coder:30b temperature: 0.1 + num_ctx: 8192 + num_predict: 4096 system_prompt: .nightshift/agents/reviewer.md pipeline: @@ -87,10 +85,7 @@ pipeline: - id: implement type: file_writer - agent_pool: - - implementer_qwen - - implementer_omnicoder - - implementer_deepseek + agent: implementer output: proposed.patch - id: normalize diff --git a/nightshift/project_templates/tutorial-pastebin/tests/test_task001.py b/nightshift/project_templates/tutorial-pastebin/tests/test_task001.py index dd2fd92..b72ec4b 100644 --- a/nightshift/project_templates/tutorial-pastebin/tests/test_task001.py +++ b/nightshift/project_templates/tutorial-pastebin/tests/test_task001.py @@ -16,6 +16,7 @@ def test_create_snippet_returns_created_snippet_id(tmp_path): assert response.status_code == 201 data = response.get_json() assert isinstance(data["id"], int) + assert (tmp_path / "snippets.db").exists() def test_view_snippet_returns_persisted_fields(tmp_path): @@ -38,6 +39,7 @@ def test_view_snippet_returns_persisted_fields(tmp_path): "title": "View me", "body": "stored body", } + assert (tmp_path / "snippets.db").exists() def test_view_missing_snippet_returns_404(tmp_path): diff --git a/nightshift/project_templates/tutorial-pastebin/tests/test_task002.py b/nightshift/project_templates/tutorial-pastebin/tests/test_task002.py new file mode 100644 index 0000000..dfe6950 --- /dev/null +++ b/nightshift/project_templates/tutorial-pastebin/tests/test_task002.py @@ -0,0 +1,50 @@ +from pastebin_app.app import create_app + + +def test_create_snippet_accepts_optional_metadata(tmp_path): + app = create_app(database_path=str(tmp_path / "snippets.db")) + client = app.test_client() + + response = client.post( + "/snippets", + json={ + "title": "Tagged", + "body": "metadata body", + "language": "python", + "tags": ["alpha", "beta"], + "expires_at": "2030-01-01T00:00:00", + }, + ) + + assert response.status_code == 201 + assert isinstance(response.get_json()["id"], int) + assert (tmp_path / "snippets.db").exists() + + +def test_view_snippet_returns_optional_metadata(tmp_path): + app = create_app(database_path=str(tmp_path / "snippets.db")) + client = app.test_client() + + created = client.post( + "/snippets", + json={ + "title": "Tagged", + "body": "metadata body", + "language": "python", + "tags": ["alpha", "beta"], + "expires_at": "2030-01-01T00:00:00", + }, + ).get_json() + + response = client.get(f"/snippets/{created['id']}") + + assert response.status_code == 200 + assert response.get_json() == { + "id": created["id"], + "title": "Tagged", + "body": "metadata body", + "language": "python", + "tags": ["alpha", "beta"], + "expires_at": "2030-01-01T00:00:00", + } + assert (tmp_path / "snippets.db").exists() diff --git a/nightshift/project_templates/tutorial-pastebin/tests/test_task003.py b/nightshift/project_templates/tutorial-pastebin/tests/test_task003.py new file mode 100644 index 0000000..33656f1 --- /dev/null +++ b/nightshift/project_templates/tutorial-pastebin/tests/test_task003.py @@ -0,0 +1,47 @@ +from pastebin_app.app import create_app + + +def _create(client, title, body, **metadata): + response = client.post("/snippets", json={"title": title, "body": body, **metadata}) + assert response.status_code == 201 + return response.get_json()["id"] + + +def test_list_snippets_newest_first(tmp_path): + app = create_app(database_path=str(tmp_path / "snippets.db")) + client = app.test_client() + + first_id = _create(client, "First", "older") + second_id = _create(client, "Second", "newer") + + response = client.get("/snippets") + + assert response.status_code == 200 + ids = [snippet["id"] for snippet in response.get_json()] + assert ids[:2] == [second_id, first_id] + + +def test_search_filters_by_title_or_body(tmp_path): + app = create_app(database_path=str(tmp_path / "snippets.db")) + client = app.test_client() + _create(client, "Python note", "ordinary body") + _create(client, "Other", "contains needle") + + response = client.get("/snippets?q=python") + assert [snippet["title"] for snippet in response.get_json()] == ["Python note"] + + response = client.get("/snippets?q=needle") + assert [snippet["title"] for snippet in response.get_json()] == ["Other"] + + +def test_language_and_tag_filters(tmp_path): + app = create_app(database_path=str(tmp_path / "snippets.db")) + client = app.test_client() + _create(client, "Python", "body", language="python", tags=["code", "demo"]) + _create(client, "Text", "body", language="text", tags=["notes"]) + + response = client.get("/snippets?language=python") + assert [snippet["title"] for snippet in response.get_json()] == ["Python"] + + response = client.get("/snippets?tag=notes") + assert [snippet["title"] for snippet in response.get_json()] == ["Text"] diff --git a/nightshift/project_templates/tutorial-pastebin/tests/test_task004.py b/nightshift/project_templates/tutorial-pastebin/tests/test_task004.py new file mode 100644 index 0000000..e6471bf --- /dev/null +++ b/nightshift/project_templates/tutorial-pastebin/tests/test_task004.py @@ -0,0 +1,43 @@ +from pastebin_app.app import create_app + + +def test_expired_snippets_are_excluded_from_listing(tmp_path): + app = create_app(database_path=str(tmp_path / "snippets.db")) + client = app.test_client() + client.post( + "/snippets", + json={"title": "Expired", "body": "old", "expires_at": "2000-01-01T00:00:00"}, + ) + active = client.post( + "/snippets", + json={"title": "Active", "body": "new", "expires_at": "2999-01-01T00:00:00"}, + ).get_json() + + response = client.get("/snippets") + + assert response.status_code == 200 + assert [snippet["id"] for snippet in response.get_json()] == [active["id"]] + + +def test_direct_lookup_of_expired_snippet_returns_410(tmp_path): + app = create_app(database_path=str(tmp_path / "snippets.db")) + client = app.test_client() + expired = client.post( + "/snippets", + json={"title": "Expired", "body": "old", "expires_at": "2000-01-01T00:00:00"}, + ).get_json() + + response = client.get(f"/snippets/{expired['id']}") + + assert response.status_code == 410 + + +def test_non_expiring_snippet_remains_visible(tmp_path): + app = create_app(database_path=str(tmp_path / "snippets.db")) + client = app.test_client() + created = client.post("/snippets", json={"title": "Forever", "body": "body"}).get_json() + + response = client.get(f"/snippets/{created['id']}") + + assert response.status_code == 200 + assert response.get_json()["title"] == "Forever" diff --git a/nightshift/project_templates/tutorial-pastebin/tests/test_task005.py b/nightshift/project_templates/tutorial-pastebin/tests/test_task005.py new file mode 100644 index 0000000..dba7a8f --- /dev/null +++ b/nightshift/project_templates/tutorial-pastebin/tests/test_task005.py @@ -0,0 +1,46 @@ +from pastebin_app.app import create_app + + +def test_root_shows_snippet_list_html(tmp_path): + app = create_app(database_path=str(tmp_path / "snippets.db")) + client = app.test_client() + client.post("/snippets", json={"title": "Visible", "body": "body"}) + + response = client.get("/") + + assert response.status_code == 200 + assert "Visible" in response.get_data(as_text=True) + + +def test_new_snippet_form_loads(tmp_path): + app = create_app(database_path=str(tmp_path / "snippets.db")) + client = app.test_client() + + response = client.get("/new") + + assert response.status_code == 200 + html = response.get_data(as_text=True) + assert 'name="title"' in html + assert 'name="body"' in html + assert 'name="language"' in html + assert 'name="tags"' in html + assert 'name="expires_at"' in html + + +def test_form_post_redirects_to_snippet_view(tmp_path): + app = create_app(database_path=str(tmp_path / "snippets.db")) + client = app.test_client() + + response = client.post( + "/new", + data={ + "title": "Form title", + "body": "Form body", + "language": "text", + "tags": "forms,html", + "expires_at": "", + }, + ) + + assert response.status_code == 302 + assert response.headers["Location"].endswith("/snippets/1") diff --git a/nightshift/task_tests.py b/nightshift/task_tests.py new file mode 100644 index 0000000..16f4ed3 --- /dev/null +++ b/nightshift/task_tests.py @@ -0,0 +1,48 @@ +"""Task-specific test file validation.""" + +from __future__ import annotations + +from dataclasses import dataclass +from pathlib import Path + +from .commands import extract_test_file_paths, render_command_template +from .config import COMMAND_STAGE_TYPES, NightShiftConfig +from .tasks import Task + + +@dataclass(frozen=True) +class TaskTestCheck: + task_id: str + path: str + exists: bool + + +def check_task_test_files(config: NightShiftConfig, tasks: tuple[Task, ...] | list[Task]) -> tuple[TaskTestCheck, ...]: + checks: list[TaskTestCheck] = [] + for task in tasks: + seen: set[str] = set() + for stage in config.pipeline.stages: + if stage.type not in COMMAND_STAGE_TYPES: + continue + for command in stage.commands: + rendered = render_command_template(command, task.id) + for path_text in extract_test_file_paths(rendered): + if path_text in seen: + continue + seen.add(path_text) + checks.append(TaskTestCheck(task.id, path_text, (config.project.root / path_text).exists())) + return tuple(checks) + + +def format_task_test_checks(checks: tuple[TaskTestCheck, ...]) -> str: + if not checks: + return "Task test files: no task-specific test paths detected." + lines = ["Task test files:"] + for check in checks: + status = "ok" if check.exists else "missing" + lines.append(f"- {check.task_id}: {check.path} ({status})") + return "\n".join(lines) + + +def missing_task_test_paths(checks: tuple[TaskTestCheck, ...]) -> tuple[Path, ...]: + return tuple(Path(check.path) for check in checks if not check.exists) diff --git a/tests/test_commands.py b/tests/test_commands.py index f17e231..e6c2b81 100644 --- a/tests/test_commands.py +++ b/tests/test_commands.py @@ -6,6 +6,7 @@ from nightshift.artifacts import ArtifactStore from nightshift.commands import CommandExecutor from nightshift.commands import CommandRun, format_command_runs from nightshift.commands import _command_env +from nightshift.commands import render_command_template from nightshift.config import SafetyConfig, StageConfig from nightshift.errors import CommandError import sys @@ -16,6 +17,13 @@ FAILING_COMMAND = 'python -c "import sys; print(\'bad\'); sys.exit(7)"' class CommandExecutorTests(unittest.TestCase): + def test_render_command_template_includes_task_id_variants(self) -> None: + command = "python -m pytest -q tests/test_{task_id_compact}.py # {task_id_slug} {task_id}" + + rendered = render_command_template(command, "TASK-001") + + self.assertEqual(rendered, "python -m pytest -q tests/test_task001.py # task_001 TASK-001") + def test_passing_command_stage_returns_pass_and_writes_output(self) -> None: with tempfile.TemporaryDirectory() as directory: root = Path(directory) @@ -46,6 +54,33 @@ class CommandExecutorTests(unittest.TestCase): self.assertIn("Exit code: 0", output) self.assertIn("ok", output) + def test_command_stage_renders_task_id_before_allowlist_check(self) -> None: + with tempfile.TemporaryDirectory() as directory: + root = Path(directory) + artifacts = ArtifactStore(root, ".nightshift", run_id="test-run") + executor = CommandExecutor( + root, + SafetyConfig( + require_clean_worktree=False, + scoped_paths=(".",), + allowed_commands=('python -c "print(\'{task_id_compact}\')"',), + forbidden_commands=("rm -rf",), + ), + artifacts, + ) + stage = StageConfig( + id="test", + type="command", + commands=('python -c "print(\'{task_id_compact}\')"',), + output="test-output.txt", + ) + + result = executor.run_stage(stage, "TASK-002") + + self.assertEqual(result.status, "pass") + output = (root / result.output_path).read_text(encoding="utf-8") + self.assertIn("task002", output) + def test_failing_command_stage_returns_fail_and_writes_output(self) -> None: with tempfile.TemporaryDirectory() as directory: root = Path(directory) diff --git a/tests/test_config.py b/tests/test_config.py index e43e1a9..865219e 100644 --- a/tests/test_config.py +++ b/tests/test_config.py @@ -282,6 +282,27 @@ class ConfigTests(unittest.TestCase): self.assertEqual(config.agents["planner"].temperature, 0.2) + def test_agent_ollama_options_load(self) -> None: + with tempfile.TemporaryDirectory() as directory: + root = Path(directory) + init_project(root) + config_path = root / "nightshift.yaml" + config_path.write_text( + config_path.read_text(encoding="utf-8").replace( + " system_prompt: agents/planner.md", + " system_prompt: agents/planner.md\n num_ctx: 8192\n num_predict: 4096\n seed: 1\n stop:\n - STOP", + 1, + ), + encoding="utf-8", + ) + + config = load_config(config_path) + + self.assertEqual(config.agents["planner"].num_ctx, 8192) + self.assertEqual(config.agents["planner"].num_predict, 4096) + self.assertEqual(config.agents["planner"].seed, 1) + self.assertEqual(config.agents["planner"].stop, ("STOP",)) + def test_agent_temperature_must_be_number(self) -> None: with tempfile.TemporaryDirectory() as directory: root = Path(directory) diff --git a/tests/test_init.py b/tests/test_init.py index db1d2d9..1724050 100644 --- a/tests/test_init.py +++ b/tests/test_init.py @@ -61,7 +61,7 @@ class InitProjectTests(unittest.TestCase): self.assertIn("tutorial-imageboard", available_templates()) self.assertIn("tutorial-pastebin", available_templates()) - def test_init_pastebin_template_creates_skeleton_and_model_fallback_config(self) -> None: + def test_init_pastebin_template_creates_skeleton_and_qwen3_config(self) -> None: with tempfile.TemporaryDirectory() as directory: root = Path(directory) @@ -78,11 +78,15 @@ class InitProjectTests(unittest.TestCase): self.assertIn("type: semantic_context", config) self.assertNotIn("id: write_tests", config) self.assertNotIn("id: review_tests", config) - self.assertIn("python -m pytest -q tests", config) + self.assertIn("python -m pytest -q tests/test_{task_id_compact}.py", config) self.assertIn("max_task_retries: 6", config) - self.assertIn("implementer_qwen", config) - self.assertIn("carstenuhlig/omnicoder-9b", config) - self.assertIn("deepseek-coder-v2:16b", config) + self.assertIn("implementer:", config) + self.assertIn("qwen3-coder:30b", config) + self.assertIn("num_ctx: 8192", config) + self.assertIn("num_predict: 4096", config) + self.assertNotIn("agent_pool:", config) + self.assertNotIn("carstenuhlig/omnicoder-9b", config) + self.assertNotIn("deepseek-coder-v2:16b", config) def test_pastebin_example_tutorial_docs_exist(self) -> None: root = Path(__file__).resolve().parents[1] diff --git a/tests/test_integ_test.py b/tests/test_integ_test.py new file mode 100644 index 0000000..803c8f2 --- /dev/null +++ b/tests/test_integ_test.py @@ -0,0 +1,51 @@ +from pathlib import Path +import tempfile +import unittest + +from nightshift.integ_report import build_integration_report, format_integration_report +from nightshift.integ_test import format_integration_test_result, run_integration_test + + +class IntegrationTestCommandTests(unittest.TestCase): + def test_run_integration_test_dry_run_builds_task_command(self) -> None: + with tempfile.TemporaryDirectory() as directory: + result = run_integration_test( + directory, + template="tutorial-pastebin", + task="TASK-001", + dry_run=True, + ) + + rendered = format_integration_test_result(result) + self.assertIn("Dry run: true", rendered) + self.assertIn("TASK-001", " ".join(result.command)) + self.assertTrue((result.run.directory / "project" / "nightshift.yaml").exists()) + + def test_build_integration_report_summarizes_latest_task_summary(self) -> None: + with tempfile.TemporaryDirectory() as directory: + root = Path(directory) + summary = root / "integ_runs" / "20260521T000000.000000Z" / "project" / ".nightshift" / "runs" / "run1" / "tasks" / "TASK-001" / "run-summary.md" + summary.parent.mkdir(parents=True) + summary.write_text( + "\n".join( + [ + "# Run Summary", + "", + "- Task: TASK-001", + "- Status: complete", + "- Retry count: 1", + "- Reason: Done.", + ] + ), + encoding="utf-8", + ) + + report = build_integration_report(root) + rendered = format_integration_report(report) + + self.assertIn("TASK-001 complete after 1 retries", rendered) + self.assertIn("Reason: Done.", rendered) + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_pipeline.py b/tests/test_pipeline.py index 113c703..8d6ed32 100644 --- a/tests/test_pipeline.py +++ b/tests/test_pipeline.py @@ -105,6 +105,29 @@ class PipelineRunnerTests(unittest.TestCase): ) self.assertIn("Modified Files", (root / ".nightshift" / "runs" / "test-run" / "run-summary.md").read_text(encoding="utf-8")) + def test_task_preflight_fails_when_task_specific_test_file_is_missing(self) -> None: + with tempfile.TemporaryDirectory() as directory: + root = Path(directory) + _write_common_files(root) + stages = ( + StageConfig( + id="test", + type="command", + commands=("python -m pytest -q tests/test_{task_id_compact}.py",), + output="test-output.txt", + ), + ) + config = make_config(root, stages, max_retries=0) + runner = PipelineRunner(config, ArtifactStore(root, ".nightshift", run_id="test-run")) + task = parse_tasks(TASK_MD)[0] + + result = runner.run_task(task) + + self.assertEqual(result.status, "failed") + self.assertIn("configured task test file is missing", result.reason) + task_dir = root / ".nightshift" / "runs" / "test-run" / "tasks" / task.id + self.assertIn("tests/test_task001.py", (task_dir / "preflight.md").read_text(encoding="utf-8")) + def test_review_can_retry_implementation_until_limit(self) -> None: with tempfile.TemporaryDirectory() as directory: root = Path(directory) diff --git a/tests/test_task_tests.py b/tests/test_task_tests.py new file mode 100644 index 0000000..71fd9a5 --- /dev/null +++ b/tests/test_task_tests.py @@ -0,0 +1,77 @@ +from pathlib import Path +import tempfile +import unittest + +from nightshift.config import validate_config +from nightshift.task_tests import check_task_test_files, missing_task_test_paths +from nightshift.tasks import parse_task_file + + +class TaskTestValidationTests(unittest.TestCase): + def test_check_task_test_files_renders_task_placeholder(self) -> None: + with tempfile.TemporaryDirectory() as directory: + root = Path(directory) + (root / "agents").mkdir() + (root / "agents" / "planner.md").write_text("Prompt", encoding="utf-8") + (root / "tests").mkdir() + (root / "tests" / "test_task001.py").write_text("def test_ok():\n assert True\n", encoding="utf-8") + (root / "nightshift.yaml").write_text( + "\n".join( + [ + "project:", + " name: task-test-validation", + " root: .", + " task_file: tasks.md", + " artifact_dir: .nightshift", + "", + "safety:", + " require_clean_worktree: false", + " scoped_paths:", + " - .", + " allowed_commands:", + " - python -m pytest -q tests/test_{task_id_compact}.py", + " forbidden_commands:", + " - rm -rf", + "", + "agents:", + " planner:", + " backend: command", + " command: python -c \"print('ok')\"", + " system_prompt: agents/planner.md", + "", + "pipeline:", + " stages:", + " - id: test", + " type: command", + " commands:", + " - python -m pytest -q tests/test_{task_id_compact}.py", + ] + ), + encoding="utf-8", + ) + (root / "tasks.md").write_text( + """# Tasks + +- [ ] TASK-001: One + +Acceptance Criteria: +- passes + +- [ ] TASK-002: Two + +Acceptance Criteria: +- reports missing test +""", + encoding="utf-8", + ) + + config = validate_config(root / "nightshift.yaml") + tasks = parse_task_file(config.project.root, config.project.task_file) + checks = check_task_test_files(config, tasks) + + self.assertEqual([check.path for check in checks], ["tests/test_task001.py", "tests/test_task002.py"]) + self.assertEqual(tuple(path.as_posix() for path in missing_task_test_paths(checks)), ("tests/test_task002.py",)) + + +if __name__ == "__main__": + unittest.main()