diff --git a/docs/codex/20260520-203827.md b/docs/codex/20260520-203827.md new file mode 100644 index 0000000..8b055ed --- /dev/null +++ b/docs/codex/20260520-203827.md @@ -0,0 +1,73 @@ +# NightShift Integration Failure Analysis + +## Immediate Causes + +I would separate the failures into four buckets: + +1. The pastebin template is not truly incremental. + `tests/test_pastebin.py` already tests listing/filtering and expiration, even though `TASK-001` only asks for create/view. The stock app also already has a fairly complete `create_app` implementation. So the task is not "build feature 1"; it is "modify an already-complete app without breaking future-task behavior." + +2. The retry stop policy is harsher than the config implies. + Even with `stop_on_repeated_failure_signature_after: 6`, `nightshift/escalation.py` unconditionally stops after the last 3 entries have the same stage and cause. That explains the "same stage same reason" stop before the configured repeated-signature threshold. + +3. The model got bad or insufficient context early. + In the run artifacts, the planner asked for `app/models.py` and `app/routes.py`, both outside the actual scoped repo. That pushed it toward a hallucinated Flask/SQLAlchemy architecture. Later repairs added `tests/test_snippets.py` importing nonexistent `app`, then tried to repair by deleting large amounts of code, which patch validation correctly rejected. + +4. The template and manual deletion created contradictory state. + In the latest project, `src/pastebin_app/__init__.py` imports `create_app`, but `src/pastebin_app/app.py` no longer defines it. `tests/test_pastebin.py` is now empty, while generated `tests/test_snippets.py` expects a different app shape. That is exactly the kind of broken intermediate state a local model will churn on unless the orchestrator gives it a very explicit recovery path. + +## On Pre-Generated Code + +I agree with your instinct: for this tutorial, pre-generated app code is hurting more than helping. + +A better template would include: + +- `pyproject.toml` +- package directories and empty `__init__.py` +- minimal templates if the task needs HTML later +- no complete app logic +- no future-task tests active during `TASK-001` +- a small `tests/test_task001.py` for only create/view + +Then `TASK-002` adds list/filter tests, `TASK-003` adds expiration tests, etc. The AI should build forward, not preserve a hidden completed app. + +## Why Claude/Codex Feel Different + +Production coding agents usually have an inner loop: + +- inspect files +- edit narrowly +- run targeted tests +- read exact failure +- inspect more files +- edit again +- rerun + +NightShift currently has a coarser loop: generate one patch, normalize, apply, run tests, summarize, retry. That is auditable, but it means each retry is another sampled patch rather than an interactive repair session. Swapping models does not fix bad task shape, bad context, or contradictory repo state. + +## Best Options + +Option A: fix the current design conservatively. + +- Remove pre-generated pastebin app logic. +- Split tests by task. +- Run only task-relevant tests during the task, then full suite after success. +- Move deterministic repo context before planning, or at least always include file tree plus full contents of likely target files. +- Make churn stopping obey config; do not hard-stop after 3 same-stage failures unless configured. +- Improve retry signatures to ignore pytest cache warnings and prefer project traceback lines. + +Option B: add a real repair micro-loop. + +For command/test failures, run a bounded repair loop before consuming another global retry: + +```text +failure -> classify -> inspect exact files -> produce small patch -> run targeted test -> repeat 2-4 times +``` + +That would make NightShift behave more like Codex/Claude while preserving artifacts. + +Option C: delegate hard repairs to production agent backends. + +Add a `codex`/`claude-code` backend stage for implementation/repair. NightShift still owns task selection, safety, artifacts, tests, and reports, but lets a stronger tool run the inner edit/test loop. + +My recommendation: do A first, then B. The template/task mismatch is the largest avoidable failure source, and the unconditional churn stop is a real policy bug. Once those are fixed, the remaining failures will be much more informative. diff --git a/examples/tutorial/03-pastebin/README.md b/examples/tutorial/03-pastebin/README.md index 20c6abd..bec55ed 100644 --- a/examples/tutorial/03-pastebin/README.md +++ b/examples/tutorial/03-pastebin/README.md @@ -44,6 +44,7 @@ nightshift.yaml .nightshift/ agents/ planner.md + test-writer.md implementer.md debugger.md reviewer.md @@ -56,7 +57,7 @@ pyproject.toml README.md ``` -The template includes a working baseline Flask app and deterministic pytest suite. NightShift tasks then extend or verify app behavior in small increments. +The template intentionally does not include a working Flask app or pre-generated task tests. For each task, NightShift first generates acceptance tests from the current task's acceptance criteria, reviews those tests for scope, and then asks the implementation agent to make them pass. ## Prerequisites @@ -85,7 +86,7 @@ NightShift uses Ollama's local HTTP API, normally at `http://localhost:11434`. ## Model Fallback -The template's implementation stage uses this fallback order: +The template writes tests with `qwen2.5-coder:14b`. The implementation stage uses this fallback order: 1. `qwen2.5-coder:14b` 2. `carstenuhlig/omnicoder-9b` @@ -93,6 +94,16 @@ The template's implementation stage uses this fallback order: NightShift records which agent/model handled each stage in `telemetry-summary.md`. +## TDD Pipeline + +The task pipeline runs in this shape: + +```text +plan -> semantic_context -> context -> write_tests -> review_tests -> implement -> pytest -> review +``` + +Generated tests should cover only the current task. They are expected to fail before implementation, so the pipeline reviews the test patch but does not run pytest until after the implementation patch is applied. + ## Task Plan The template writes the full task list to `.nightshift/tasks.md`. A copy is included here as [tasks.md](tasks.md). diff --git a/examples/tutorial/03-pastebin/nightshift.yaml b/examples/tutorial/03-pastebin/nightshift.yaml index d4f3172..431d783 100644 --- a/examples/tutorial/03-pastebin/nightshift.yaml +++ b/examples/tutorial/03-pastebin/nightshift.yaml @@ -21,7 +21,7 @@ safety: experiment: label: pastebin-model-fallback - prompt_variant: qwen-omnicoder-deepseek-v1 + prompt_variant: tdd-qwen-omnicoder-deepseek-v2 agents: planner: @@ -36,6 +36,12 @@ agents: temperature: 0.1 system_prompt: .nightshift/agents/implementer.md + test_writer: + backend: ollama + model: qwen2.5-coder:14b + temperature: 0.1 + system_prompt: .nightshift/agents/test-writer.md + implementer_omnicoder: backend: ollama model: carstenuhlig/omnicoder-9b @@ -62,7 +68,8 @@ agents: system_prompt: .nightshift/agents/reviewer.md pipeline: - max_task_retries: 3 + max_task_retries: 6 + stop_on_repeated_failure_signature_after: 6 continue_on_task_failure: false stages: - id: plan @@ -78,6 +85,35 @@ pipeline: type: repo_context output: context-pack.md + - id: write_tests + type: file_writer + agent: test_writer + output: proposed-tests.patch + + - id: normalize_tests + type: patch_normalizer + output: normalized-tests.patch + + - id: validate_tests_patch + type: patch_validator + output: test-patch-validation.md + max_files: 6 + max_lines: 500 + max_delete_ratio: 0.70 + on_fail: write_tests + + - id: apply_tests_patch + type: patch_apply + mode: apply + output: test-patch-apply-output.txt + on_fail: write_tests + + - id: review_tests + type: agent_review + agent: reviewer + output: test-review.md + on_fail: write_tests + - id: implement type: file_writer agent_pool: diff --git a/nightshift/escalation.py b/nightshift/escalation.py index 23d5157..c7d4ad6 100644 --- a/nightshift/escalation.py +++ b/nightshift/escalation.py @@ -22,17 +22,17 @@ def evaluate_retry_churn( ) -> EscalationDecision: if len(entries) < 2: return EscalationDecision(False, "continue", "Not enough retry history for churn detection.") - recent = entries[-3:] - same_stage = len({entry.stage_id for entry in recent}) == 1 - same_cause = len({entry.cause for entry in recent}) == 1 - recent_signatures = [entry.failure_signature for entry in entries[-2:] if entry.failure_signature] - same_signature = len(recent_signatures) == 2 and len(set(recent_signatures)) == 1 + churn_threshold = repeated_signature_after if repeated_signature_after and repeated_signature_after > 0 else 3 + signature_window = entries[-churn_threshold:] if len(entries) >= churn_threshold else () + recent_signatures = [entry.failure_signature for entry in signature_window if entry.failure_signature] + same_signature = len(recent_signatures) == churn_threshold and len(set(recent_signatures)) == 1 + stage_cause_window = entries[-churn_threshold:] if len(entries) >= churn_threshold else () + same_stage = bool(stage_cause_window) and len({entry.stage_id for entry in stage_cause_window}) == 1 + same_cause = bool(stage_cause_window) and len({entry.cause for entry in stage_cause_window}) == 1 if len(entries) >= retry_budget and retry_budget > 0: return EscalationDecision(True, "human review", "Configured retry budget is exhausted.") if ( - repeated_signature_after is not None - and repeated_signature_after > 0 - and len(entries) >= repeated_signature_after + len(entries) >= churn_threshold and same_signature ): return EscalationDecision( @@ -40,7 +40,7 @@ def evaluate_retry_churn( "debugger review or larger model", "The same failure signature repeated on consecutive retries.", ) - if len(recent) == 3 and same_stage and same_cause: + if len(entries) >= churn_threshold and same_stage and same_cause: return EscalationDecision(True, "debugger review or larger model", "The same stage is failing with the same reason repeatedly.") return EscalationDecision(False, "continue", "No retry churn detected.") diff --git a/nightshift/pipeline.py b/nightshift/pipeline.py index 863d213..98fab13 100644 --- a/nightshift/pipeline.py +++ b/nightshift/pipeline.py @@ -251,8 +251,7 @@ class PipelineRunner: decision = evaluate_retry_churn( tuple(retry_memory), retry_budget=self.config.pipeline.max_task_retries + 1, - repeated_signature_after=self.config.pipeline.stop_on_repeated_failure_signature_after - or self.config.pipeline.max_task_retries, + repeated_signature_after=self.config.pipeline.stop_on_repeated_failure_signature_after, ) self.artifacts.write_stage_output( task.id, @@ -592,8 +591,8 @@ class PipelineRunner: f"# Implementation Summary\n\nStatus: fail\nReason: {exc}\n", ) return StageResult(stage.id, "fail", str(exc), output_path=result.output_path) - patch_filename = "repair-{0}.patch".format(retry_count) if retry_count else (stage.output or "proposed.patch") - summary_filename = "implementation-summary.md" if retry_count == 0 else f"repair-summary-{retry_count}.md" + patch_filename = _writer_patch_filename(stage, retry_count) + summary_filename = _writer_summary_filename(stage, retry_count) proposed_path = self.artifacts.write_stage_output(task.id, patch_filename, patch) summary_path = self.artifacts.write_stage_output( task.id, @@ -728,7 +727,7 @@ class PipelineRunner: try: patch = normalize_patch_text(stdout) except PipelineError: - summary_filename = "implementation-summary.md" if retry_count == 0 else f"repair-summary-{retry_count}.md" + summary_filename = _writer_summary_filename(stage, retry_count) reason = str(exc) if "generated patch has no changes" in reason: next_stage = self._stage_after_patch_flow(stage.id) @@ -758,8 +757,8 @@ class PipelineRunner: patch_reason = "Fallback patch written from unified diff output." log_message = "Wrote fallback patch from unified diff output" break - patch_filename = "repair-{0}.patch".format(retry_count) if retry_count else (stage.output or "proposed.patch") - summary_filename = "implementation-summary.md" if retry_count == 0 else f"repair-summary-{retry_count}.md" + patch_filename = _writer_patch_filename(stage, retry_count) + summary_filename = _writer_summary_filename(stage, retry_count) proposed_path = self.artifacts.write_stage_output(task.id, patch_filename, patch) summary_path = self.artifacts.write_stage_output( task.id, @@ -1381,6 +1380,21 @@ def _latest_patch_like_output(previous_outputs: dict[str, str]) -> str: raise PipelineError("Patch error: no previous patch output found.") +def _writer_patch_filename(stage: StageConfig, retry_count: int) -> str: + if retry_count <= 0: + return stage.output or "proposed.patch" + if stage.type == "code_writer" or stage.id == "implement": + return f"repair-{retry_count}.patch" + return _attempt_filename(stage.output or f"{stage.id}.patch", retry_count) + + +def _writer_summary_filename(stage: StageConfig, retry_count: int) -> str: + if stage.type == "code_writer" or stage.id == "implement": + return "implementation-summary.md" if retry_count <= 0 else f"repair-summary-{retry_count}.md" + base = f"{stage.id}-summary.md" + return base if retry_count <= 0 else _attempt_filename(base, retry_count) + + def _attempt_filename(filename: str, retry_count: int) -> str: if retry_count <= 0: return filename diff --git a/nightshift/project_templates/basic/nightshift.yaml b/nightshift/project_templates/basic/nightshift.yaml index 13f5f86..616681a 100644 --- a/nightshift/project_templates/basic/nightshift.yaml +++ b/nightshift/project_templates/basic/nightshift.yaml @@ -38,7 +38,7 @@ agents: system_prompt: agents/debugger.md pipeline: - max_task_retries: 3 + max_task_retries: 6 stages: - id: plan type: agent diff --git a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/debugger.md b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/debugger.md index 1b58041..634b1b1 100644 --- a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/debugger.md +++ b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/debugger.md @@ -1,6 +1,9 @@ You are the debugger agent for the NightShift pastebin tutorial. Diagnose failed attempts without editing files. +Distinguish inaccurate generated tests from implementation bugs. +If tests are inaccurate for the current task, recommend retrying `write_tests`. +If implementation is wrong, recommend the smallest implementation repair and name files that should not be modified. Return: - concise diagnosis - recommended next action diff --git a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/implementer.md b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/implementer.md index 7002c42..818eb9d 100644 --- a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/implementer.md +++ b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/implementer.md @@ -1,5 +1,11 @@ You are the implementation agent for the NightShift pastebin tutorial. +Implement the smallest application change that satisfies the current task and the generated tests. +Do not rewrite generated tests unless the retry context explicitly says they are inaccurate. +Do not add behavior for future tasks unless needed to satisfy the current tests. +Use Flask and sqlite from the standard library unless existing project files already introduce another framework. +Keep the public package name `pastebin_app`. + Output only complete file content blocks. Use one fenced block per file: ```file:relative/path.py diff --git a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/planner.md b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/planner.md index a6d8658..47c0f67 100644 --- a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/planner.md +++ b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/planner.md @@ -1,7 +1,14 @@ You are the planning agent for the NightShift pastebin tutorial. -Create a concise implementation plan for the current task. +Create a concise TDD implementation plan for the current task. + +Plan in this order: +1. Which acceptance tests should be generated for only this task. +2. Which application files likely need to change. +3. The smallest implementation slice that should make those tests pass. If repository context is needed, request it with lookup_requests. Prefer small edits and deterministic tests. +Do not assume files outside the configured scoped paths exist. +Do not propose SQLAlchemy unless existing repository files already use it. Do not write code. diff --git a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/reviewer.md b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/reviewer.md index 39606f4..72c292a 100644 --- a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/reviewer.md +++ b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/reviewer.md @@ -1,5 +1,8 @@ You are the review agent for the NightShift pastebin tutorial. +When reviewing generated tests, check that they map only to the current task acceptance criteria and do not require future-task behavior. +When reviewing implementation, check that the change is small, deterministic, and satisfies the generated tests without unrelated rewrites. + Output exactly: status: pass | fail | retry | escalate diff --git a/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/test-writer.md b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/test-writer.md new file mode 100644 index 0000000..1516bc6 --- /dev/null +++ b/nightshift/project_templates/tutorial-pastebin/.nightshift/agents/test-writer.md @@ -0,0 +1,16 @@ +You are the test-writing agent for the NightShift pastebin tutorial. + +Write only tests for the current task's acceptance criteria. +Do not implement application code. +Do not add tests for future tasks or behavior not named in the current task. + +Output only complete file content blocks. +Use one fenced block per file: +```file:relative/path.py + +``` + +Prefer pytest tests that describe the public behavior from the task. +Keep tests deterministic and isolated with temporary databases or temporary paths. +Use the existing package name `pastebin_app`. +If the app factory does not exist yet, write tests for the expected public interface that the implementer should create. diff --git a/nightshift/project_templates/tutorial-pastebin/README.md b/nightshift/project_templates/tutorial-pastebin/README.md index 3123b61..b18df22 100644 --- a/nightshift/project_templates/tutorial-pastebin/README.md +++ b/nightshift/project_templates/tutorial-pastebin/README.md @@ -55,3 +55,5 @@ The pipeline uses model fallback ordering for implementation attempts: 3. `deepseek-coder-v2:16b` Telemetry artifacts record which agent/model handled each stage and estimate token usage. + +This template uses a TDD-oriented pipeline. It starts with a skeletal package, generates task-specific pytest tests from the current task acceptance criteria, reviews those tests for scope, and then implements only enough application code to pass them. diff --git a/nightshift/project_templates/tutorial-pastebin/nightshift.yaml b/nightshift/project_templates/tutorial-pastebin/nightshift.yaml index d4f3172..431d783 100644 --- a/nightshift/project_templates/tutorial-pastebin/nightshift.yaml +++ b/nightshift/project_templates/tutorial-pastebin/nightshift.yaml @@ -21,7 +21,7 @@ safety: experiment: label: pastebin-model-fallback - prompt_variant: qwen-omnicoder-deepseek-v1 + prompt_variant: tdd-qwen-omnicoder-deepseek-v2 agents: planner: @@ -36,6 +36,12 @@ agents: temperature: 0.1 system_prompt: .nightshift/agents/implementer.md + test_writer: + backend: ollama + model: qwen2.5-coder:14b + temperature: 0.1 + system_prompt: .nightshift/agents/test-writer.md + implementer_omnicoder: backend: ollama model: carstenuhlig/omnicoder-9b @@ -62,7 +68,8 @@ agents: system_prompt: .nightshift/agents/reviewer.md pipeline: - max_task_retries: 3 + max_task_retries: 6 + stop_on_repeated_failure_signature_after: 6 continue_on_task_failure: false stages: - id: plan @@ -78,6 +85,35 @@ pipeline: type: repo_context output: context-pack.md + - id: write_tests + type: file_writer + agent: test_writer + output: proposed-tests.patch + + - id: normalize_tests + type: patch_normalizer + output: normalized-tests.patch + + - id: validate_tests_patch + type: patch_validator + output: test-patch-validation.md + max_files: 6 + max_lines: 500 + max_delete_ratio: 0.70 + on_fail: write_tests + + - id: apply_tests_patch + type: patch_apply + mode: apply + output: test-patch-apply-output.txt + on_fail: write_tests + + - id: review_tests + type: agent_review + agent: reviewer + output: test-review.md + on_fail: write_tests + - id: implement type: file_writer agent_pool: diff --git a/nightshift/project_templates/tutorial-pastebin/src/pastebin_app/__init__.py b/nightshift/project_templates/tutorial-pastebin/src/pastebin_app/__init__.py index b94a1e8..8a89423 100644 --- a/nightshift/project_templates/tutorial-pastebin/src/pastebin_app/__init__.py +++ b/nightshift/project_templates/tutorial-pastebin/src/pastebin_app/__init__.py @@ -1,3 +1 @@ -from .app import create_app - -__all__ = ["create_app"] +"""Pastebin package for the NightShift tutorial.""" diff --git a/nightshift/project_templates/tutorial-pastebin/src/pastebin_app/app.py b/nightshift/project_templates/tutorial-pastebin/src/pastebin_app/app.py index 351edcd..f686bb5 100644 --- a/nightshift/project_templates/tutorial-pastebin/src/pastebin_app/app.py +++ b/nightshift/project_templates/tutorial-pastebin/src/pastebin_app/app.py @@ -1,128 +1 @@ -from __future__ import annotations - -from datetime import datetime, timezone -import sqlite3 -from pathlib import Path - -from flask import Flask, abort, g, jsonify, redirect, render_template, request, url_for - - -SCHEMA = """ -create table if not exists snippets ( - id integer primary key autoincrement, - title text not null, - body text not null, - language text default '', - tags text default '', - expires_at text default '', - created_at text not null -); -""" - - -def create_app(database_path: str | Path | None = None) -> Flask: - app = Flask(__name__, template_folder=str(Path(__file__).resolve().parents[2] / "templates")) - app.config["DATABASE"] = str(database_path or Path(app.instance_path) / "pastebin.sqlite3") - - @app.before_request - def _open_db() -> None: - Path(app.config["DATABASE"]).parent.mkdir(parents=True, exist_ok=True) - g.db = sqlite3.connect(app.config["DATABASE"]) - g.db.row_factory = sqlite3.Row - g.db.execute(SCHEMA) - - @app.teardown_request - def _close_db(exc) -> None: - db = g.pop("db", None) - if db is not None: - db.close() - - @app.get("/") - def index(): - snippets = list_snippets(g.db, request.args) - return render_template("index.html", snippets=snippets) - - @app.get("/new") - def new_snippet(): - return render_template("new.html") - - @app.post("/snippets") - def create_snippet_route(): - snippet_id = create_snippet(g.db, request.form or request.json or {}) - wants_json = request.is_json or "application/json" in request.headers.get("Accept", "") - if wants_json: - return jsonify(get_snippet(g.db, snippet_id)), 201 - return redirect(url_for("view_snippet", snippet_id=snippet_id)) - - @app.get("/snippets") - def list_snippets_route(): - snippets = list_snippets(g.db, request.args) - if "application/json" in request.headers.get("Accept", ""): - return jsonify(snippets) - return render_template("index.html", snippets=snippets) - - @app.get("/snippets/") - def view_snippet(snippet_id: int): - snippet = get_snippet(g.db, snippet_id) - if snippet is None: - abort(404) - if is_expired(snippet): - abort(410) - if "application/json" in request.headers.get("Accept", ""): - return jsonify(snippet) - return render_template("view.html", snippet=snippet) - - return app - - -def create_snippet(db: sqlite3.Connection, data) -> int: - title = str(data.get("title", "")).strip() - body = str(data.get("body", "")).strip() - if not title or not body: - raise ValueError("title and body are required") - cursor = db.execute( - "insert into snippets(title, body, language, tags, expires_at, created_at) values (?, ?, ?, ?, ?, ?)", - ( - title, - body, - str(data.get("language", "")).strip(), - str(data.get("tags", "")).strip(), - str(data.get("expires_at", "")).strip(), - datetime.now(timezone.utc).isoformat(), - ), - ) - db.commit() - return int(cursor.lastrowid) - - -def get_snippet(db: sqlite3.Connection, snippet_id: int) -> dict | None: - row = db.execute("select * from snippets where id = ?", (snippet_id,)).fetchone() - return dict(row) if row else None - - -def list_snippets(db: sqlite3.Connection, args) -> list[dict]: - rows = db.execute("select * from snippets order by id desc").fetchall() - snippets = [dict(row) for row in rows if not is_expired(dict(row))] - query = str(args.get("q", "")).lower() - language = str(args.get("language", "")).lower() - tag = str(args.get("tag", "")).lower() - if query: - snippets = [item for item in snippets if query in item["title"].lower() or query in item["body"].lower()] - if language: - snippets = [item for item in snippets if item["language"].lower() == language] - if tag: - snippets = [item for item in snippets if tag in [part.strip().lower() for part in item["tags"].split(",")]] - return snippets - - -def is_expired(snippet: dict) -> bool: - value = snippet.get("expires_at") or "" - if not value: - return False - try: - expires = datetime.fromisoformat(value) - except ValueError: - return False - if expires.tzinfo is None: - expires = expires.replace(tzinfo=timezone.utc) - return expires <= datetime.now(timezone.utc) +"""Application code is generated by the NightShift tutorial tasks.""" diff --git a/nightshift/project_templates/tutorial-pastebin/templates/.gitkeep b/nightshift/project_templates/tutorial-pastebin/templates/.gitkeep new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/nightshift/project_templates/tutorial-pastebin/templates/.gitkeep @@ -0,0 +1 @@ + diff --git a/nightshift/project_templates/tutorial-pastebin/templates/index.html b/nightshift/project_templates/tutorial-pastebin/templates/index.html deleted file mode 100644 index f844ca3..0000000 --- a/nightshift/project_templates/tutorial-pastebin/templates/index.html +++ /dev/null @@ -1,18 +0,0 @@ - - - -

Snippets

- New snippet -
- - - - -
- - - diff --git a/nightshift/project_templates/tutorial-pastebin/templates/new.html b/nightshift/project_templates/tutorial-pastebin/templates/new.html deleted file mode 100644 index c7327d3..0000000 --- a/nightshift/project_templates/tutorial-pastebin/templates/new.html +++ /dev/null @@ -1,14 +0,0 @@ - - - -

New Snippet

-
- - - - - - -
- - diff --git a/nightshift/project_templates/tutorial-pastebin/templates/view.html b/nightshift/project_templates/tutorial-pastebin/templates/view.html deleted file mode 100644 index 131c519..0000000 --- a/nightshift/project_templates/tutorial-pastebin/templates/view.html +++ /dev/null @@ -1,8 +0,0 @@ - - - -

{{ snippet.title }}

-

{{ snippet.language }} {{ snippet.tags }}

-
{{ snippet.body }}
- - diff --git a/nightshift/project_templates/tutorial-pastebin/tests/.gitkeep b/nightshift/project_templates/tutorial-pastebin/tests/.gitkeep new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/nightshift/project_templates/tutorial-pastebin/tests/.gitkeep @@ -0,0 +1 @@ + diff --git a/nightshift/project_templates/tutorial-pastebin/tests/test_pastebin.py b/nightshift/project_templates/tutorial-pastebin/tests/test_pastebin.py deleted file mode 100644 index fe8f22a..0000000 --- a/nightshift/project_templates/tutorial-pastebin/tests/test_pastebin.py +++ /dev/null @@ -1,51 +0,0 @@ -from datetime import datetime, timedelta, timezone - -from pastebin_app import create_app - - -def client(tmp_path): - app = create_app(tmp_path / "pastebin.sqlite3") - app.config["TESTING"] = True - return app.test_client() - - -def test_create_and_view_snippet(tmp_path): - test_client = client(tmp_path) - response = test_client.post( - "/snippets", - json={"title": "Hello", "body": "print('hi')", "language": "python", "tags": "demo,test"}, - headers={"Accept": "application/json"}, - ) - - assert response.status_code == 201 - snippet_id = response.get_json()["id"] - view = test_client.get(f"/snippets/{snippet_id}", headers={"Accept": "application/json"}) - assert view.status_code == 200 - assert view.get_json()["language"] == "python" - - -def test_list_search_and_filters(tmp_path): - test_client = client(tmp_path) - test_client.post("/snippets", json={"title": "Python note", "body": "flask route", "language": "python", "tags": "web"}) - test_client.post("/snippets", json={"title": "SQL note", "body": "select", "language": "sql", "tags": "data"}) - - search = test_client.get("/snippets?q=flask", headers={"Accept": "application/json"}).get_json() - language = test_client.get("/snippets?language=sql", headers={"Accept": "application/json"}).get_json() - tag = test_client.get("/snippets?tag=web", headers={"Accept": "application/json"}).get_json() - - assert [item["title"] for item in search] == ["Python note"] - assert [item["title"] for item in language] == ["SQL note"] - assert [item["title"] for item in tag] == ["Python note"] - - -def test_expired_snippet_hidden_and_direct_lookup_gone(tmp_path): - test_client = client(tmp_path) - expired = (datetime.now(timezone.utc) - timedelta(days=1)).isoformat() - response = test_client.post("/snippets", json={"title": "Old", "body": "gone", "expires_at": expired}, headers={"Accept": "application/json"}) - snippet_id = response.get_json()["id"] - - listed = test_client.get("/snippets", headers={"Accept": "application/json"}).get_json() - direct = test_client.get(f"/snippets/{snippet_id}", headers={"Accept": "application/json"}) - - assert listed == [] - assert direct.status_code == 410 diff --git a/nightshift/templates.py b/nightshift/templates.py index 115dbf0..d5ed071 100644 --- a/nightshift/templates.py +++ b/nightshift/templates.py @@ -40,7 +40,7 @@ agents: system_prompt: agents/debugger.md pipeline: - max_task_retries: 3 + max_task_retries: 6 stages: - id: plan type: agent @@ -195,7 +195,7 @@ agents: system_prompt: .nightshift/agents/debugger.md pipeline: - max_task_retries: 3 + max_task_retries: 6 continue_on_task_failure: false stages: - id: plan diff --git a/tests/test_config.py b/tests/test_config.py index 9526d69..186a3db 100644 --- a/tests/test_config.py +++ b/tests/test_config.py @@ -17,7 +17,7 @@ class ConfigTests(unittest.TestCase): self.assertEqual(config.project.name, "example-project") self.assertIn("planner", config.agents) - self.assertEqual(config.pipeline.max_task_retries, 3) + self.assertEqual(config.pipeline.max_task_retries, 6) self.assertEqual(config.pipeline.stages[0].id, "plan") def test_missing_required_section_fails_clearly(self) -> None: @@ -86,7 +86,7 @@ class ConfigTests(unittest.TestCase): config_path = root / "nightshift.yaml" config_path.write_text( config_path.read_text(encoding="utf-8").replace( - "max_task_retries: 3", + "max_task_retries: 6", "max_task_retries: three", ), encoding="utf-8", diff --git a/tests/test_init.py b/tests/test_init.py index 34bab1d..306fdfc 100644 --- a/tests/test_init.py +++ b/tests/test_init.py @@ -61,7 +61,7 @@ class InitProjectTests(unittest.TestCase): self.assertIn("tutorial-imageboard", available_templates()) self.assertIn("tutorial-pastebin", available_templates()) - def test_init_pastebin_template_creates_app_and_model_fallback_config(self) -> None: + def test_init_pastebin_template_creates_skeleton_and_tdd_model_fallback_config(self) -> None: with tempfile.TemporaryDirectory() as directory: root = Path(directory) @@ -69,9 +69,14 @@ class InitProjectTests(unittest.TestCase): config = (root / "nightshift.yaml").read_text(encoding="utf-8") self.assertTrue((root / ".nightshift" / "tasks.md").exists()) + self.assertTrue((root / ".nightshift" / "agents" / "test-writer.md").exists()) self.assertTrue((root / "src" / "pastebin_app" / "app.py").exists()) - self.assertTrue((root / "tests" / "test_pastebin.py").exists()) + self.assertTrue((root / "tests" / ".gitkeep").exists()) + self.assertFalse((root / "tests" / "test_pastebin.py").exists()) self.assertIn("type: semantic_context", config) + self.assertIn("id: write_tests", config) + self.assertIn("id: review_tests", config) + self.assertIn("max_task_retries: 6", config) self.assertIn("implementer_qwen", config) self.assertIn("carstenuhlig/omnicoder-9b", config) self.assertIn("deepseek-coder-v2:16b", config) diff --git a/tests/test_reliability_features.py b/tests/test_reliability_features.py index d676391..b52983f 100644 --- a/tests/test_reliability_features.py +++ b/tests/test_reliability_features.py @@ -78,6 +78,23 @@ class ReliabilityFeatureTests(unittest.TestCase): self.assertTrue(decision.should_stop) self.assertIn("same failure signature", decision.reason) + def test_retry_churn_honors_configured_repeated_failure_threshold(self) -> None: + entries = tuple( + RetryMemoryEntry( + attempt=attempt, + stage_id="test", + status="fail", + cause="Command exited with code 1: python -m pytest -q", + next_stage="implement", + failure_signature="NameError | src/pastebin_app/app.py | 31 | python -m pytest -q", + ) + for attempt in range(1, 4) + ) + + decision = evaluate_retry_churn(entries, retry_budget=7, repeated_signature_after=6) + + self.assertFalse(decision.should_stop) + def test_build_failure_signature_prefers_project_traceback_over_pytest_cache(self) -> None: signature = build_failure_signature( "\n".join(