mirror of
https://github.com/khodges42/nightShift.git
synced 2026-06-14 18:18:36 +00:00
Fixes based on tests, do tdd
Changed the pastebin tutorial so it now starts skeletal: no prebuilt Flask behavior, no pre-generated task tests, and .gitkeep placeholders under templates/ and tests/. The new pipeline in nightshift/project_templates/tutorial-pastebin/nightshift.yaml:1 now runs:
plan -> semantic_context -> context -> write_tests -> review_tests -> implement -> pytest -> review
────────────────────────────────────────────────────
Added nightshift/project_templates/tutorial-pastebin/.nightshift/agents/test-writer.md:1, tightened the planner/implementer/reviewer/debugger prompts, mirrored the pipeline docs/
example, and raised default retries to 6 for the basic starter plus pastebin.
I also fixed the retry policy issue in nightshift/escalation.py:17 and nightshift/pipeline.py:251: configured repeated-failure thresholds are now respected instead of hard-stopping in nightshift/project_templates/tutorial-pastebin/
early after three same-stage/same-cause failures. Non-implementation file_writer stages now get stage-specific retry artifacts so test generation does not collide with implementation
repair artifacts
This commit is contained in:
parent
c12493a248
commit
3bb5bd4157
73
docs/codex/20260520-203827.md
Normal file
73
docs/codex/20260520-203827.md
Normal file
|
|
@ -0,0 +1,73 @@
|
||||||
|
# NightShift Integration Failure Analysis
|
||||||
|
|
||||||
|
## Immediate Causes
|
||||||
|
|
||||||
|
I would separate the failures into four buckets:
|
||||||
|
|
||||||
|
1. The pastebin template is not truly incremental.
|
||||||
|
`tests/test_pastebin.py` already tests listing/filtering and expiration, even though `TASK-001` only asks for create/view. The stock app also already has a fairly complete `create_app` implementation. So the task is not "build feature 1"; it is "modify an already-complete app without breaking future-task behavior."
|
||||||
|
|
||||||
|
2. The retry stop policy is harsher than the config implies.
|
||||||
|
Even with `stop_on_repeated_failure_signature_after: 6`, `nightshift/escalation.py` unconditionally stops after the last 3 entries have the same stage and cause. That explains the "same stage same reason" stop before the configured repeated-signature threshold.
|
||||||
|
|
||||||
|
3. The model got bad or insufficient context early.
|
||||||
|
In the run artifacts, the planner asked for `app/models.py` and `app/routes.py`, both outside the actual scoped repo. That pushed it toward a hallucinated Flask/SQLAlchemy architecture. Later repairs added `tests/test_snippets.py` importing nonexistent `app`, then tried to repair by deleting large amounts of code, which patch validation correctly rejected.
|
||||||
|
|
||||||
|
4. The template and manual deletion created contradictory state.
|
||||||
|
In the latest project, `src/pastebin_app/__init__.py` imports `create_app`, but `src/pastebin_app/app.py` no longer defines it. `tests/test_pastebin.py` is now empty, while generated `tests/test_snippets.py` expects a different app shape. That is exactly the kind of broken intermediate state a local model will churn on unless the orchestrator gives it a very explicit recovery path.
|
||||||
|
|
||||||
|
## On Pre-Generated Code
|
||||||
|
|
||||||
|
I agree with your instinct: for this tutorial, pre-generated app code is hurting more than helping.
|
||||||
|
|
||||||
|
A better template would include:
|
||||||
|
|
||||||
|
- `pyproject.toml`
|
||||||
|
- package directories and empty `__init__.py`
|
||||||
|
- minimal templates if the task needs HTML later
|
||||||
|
- no complete app logic
|
||||||
|
- no future-task tests active during `TASK-001`
|
||||||
|
- a small `tests/test_task001.py` for only create/view
|
||||||
|
|
||||||
|
Then `TASK-002` adds list/filter tests, `TASK-003` adds expiration tests, etc. The AI should build forward, not preserve a hidden completed app.
|
||||||
|
|
||||||
|
## Why Claude/Codex Feel Different
|
||||||
|
|
||||||
|
Production coding agents usually have an inner loop:
|
||||||
|
|
||||||
|
- inspect files
|
||||||
|
- edit narrowly
|
||||||
|
- run targeted tests
|
||||||
|
- read exact failure
|
||||||
|
- inspect more files
|
||||||
|
- edit again
|
||||||
|
- rerun
|
||||||
|
|
||||||
|
NightShift currently has a coarser loop: generate one patch, normalize, apply, run tests, summarize, retry. That is auditable, but it means each retry is another sampled patch rather than an interactive repair session. Swapping models does not fix bad task shape, bad context, or contradictory repo state.
|
||||||
|
|
||||||
|
## Best Options
|
||||||
|
|
||||||
|
Option A: fix the current design conservatively.
|
||||||
|
|
||||||
|
- Remove pre-generated pastebin app logic.
|
||||||
|
- Split tests by task.
|
||||||
|
- Run only task-relevant tests during the task, then full suite after success.
|
||||||
|
- Move deterministic repo context before planning, or at least always include file tree plus full contents of likely target files.
|
||||||
|
- Make churn stopping obey config; do not hard-stop after 3 same-stage failures unless configured.
|
||||||
|
- Improve retry signatures to ignore pytest cache warnings and prefer project traceback lines.
|
||||||
|
|
||||||
|
Option B: add a real repair micro-loop.
|
||||||
|
|
||||||
|
For command/test failures, run a bounded repair loop before consuming another global retry:
|
||||||
|
|
||||||
|
```text
|
||||||
|
failure -> classify -> inspect exact files -> produce small patch -> run targeted test -> repeat 2-4 times
|
||||||
|
```
|
||||||
|
|
||||||
|
That would make NightShift behave more like Codex/Claude while preserving artifacts.
|
||||||
|
|
||||||
|
Option C: delegate hard repairs to production agent backends.
|
||||||
|
|
||||||
|
Add a `codex`/`claude-code` backend stage for implementation/repair. NightShift still owns task selection, safety, artifacts, tests, and reports, but lets a stronger tool run the inner edit/test loop.
|
||||||
|
|
||||||
|
My recommendation: do A first, then B. The template/task mismatch is the largest avoidable failure source, and the unconditional churn stop is a real policy bug. Once those are fixed, the remaining failures will be much more informative.
|
||||||
|
|
@ -44,6 +44,7 @@ nightshift.yaml
|
||||||
.nightshift/
|
.nightshift/
|
||||||
agents/
|
agents/
|
||||||
planner.md
|
planner.md
|
||||||
|
test-writer.md
|
||||||
implementer.md
|
implementer.md
|
||||||
debugger.md
|
debugger.md
|
||||||
reviewer.md
|
reviewer.md
|
||||||
|
|
@ -56,7 +57,7 @@ pyproject.toml
|
||||||
README.md
|
README.md
|
||||||
```
|
```
|
||||||
|
|
||||||
The template includes a working baseline Flask app and deterministic pytest suite. NightShift tasks then extend or verify app behavior in small increments.
|
The template intentionally does not include a working Flask app or pre-generated task tests. For each task, NightShift first generates acceptance tests from the current task's acceptance criteria, reviews those tests for scope, and then asks the implementation agent to make them pass.
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
|
|
@ -85,7 +86,7 @@ NightShift uses Ollama's local HTTP API, normally at `http://localhost:11434`.
|
||||||
|
|
||||||
## Model Fallback
|
## Model Fallback
|
||||||
|
|
||||||
The template's implementation stage uses this fallback order:
|
The template writes tests with `qwen2.5-coder:14b`. The implementation stage uses this fallback order:
|
||||||
|
|
||||||
1. `qwen2.5-coder:14b`
|
1. `qwen2.5-coder:14b`
|
||||||
2. `carstenuhlig/omnicoder-9b`
|
2. `carstenuhlig/omnicoder-9b`
|
||||||
|
|
@ -93,6 +94,16 @@ The template's implementation stage uses this fallback order:
|
||||||
|
|
||||||
NightShift records which agent/model handled each stage in `telemetry-summary.md`.
|
NightShift records which agent/model handled each stage in `telemetry-summary.md`.
|
||||||
|
|
||||||
|
## TDD Pipeline
|
||||||
|
|
||||||
|
The task pipeline runs in this shape:
|
||||||
|
|
||||||
|
```text
|
||||||
|
plan -> semantic_context -> context -> write_tests -> review_tests -> implement -> pytest -> review
|
||||||
|
```
|
||||||
|
|
||||||
|
Generated tests should cover only the current task. They are expected to fail before implementation, so the pipeline reviews the test patch but does not run pytest until after the implementation patch is applied.
|
||||||
|
|
||||||
## Task Plan
|
## Task Plan
|
||||||
|
|
||||||
The template writes the full task list to `.nightshift/tasks.md`. A copy is included here as [tasks.md](tasks.md).
|
The template writes the full task list to `.nightshift/tasks.md`. A copy is included here as [tasks.md](tasks.md).
|
||||||
|
|
|
||||||
|
|
@ -21,7 +21,7 @@ safety:
|
||||||
|
|
||||||
experiment:
|
experiment:
|
||||||
label: pastebin-model-fallback
|
label: pastebin-model-fallback
|
||||||
prompt_variant: qwen-omnicoder-deepseek-v1
|
prompt_variant: tdd-qwen-omnicoder-deepseek-v2
|
||||||
|
|
||||||
agents:
|
agents:
|
||||||
planner:
|
planner:
|
||||||
|
|
@ -36,6 +36,12 @@ agents:
|
||||||
temperature: 0.1
|
temperature: 0.1
|
||||||
system_prompt: .nightshift/agents/implementer.md
|
system_prompt: .nightshift/agents/implementer.md
|
||||||
|
|
||||||
|
test_writer:
|
||||||
|
backend: ollama
|
||||||
|
model: qwen2.5-coder:14b
|
||||||
|
temperature: 0.1
|
||||||
|
system_prompt: .nightshift/agents/test-writer.md
|
||||||
|
|
||||||
implementer_omnicoder:
|
implementer_omnicoder:
|
||||||
backend: ollama
|
backend: ollama
|
||||||
model: carstenuhlig/omnicoder-9b
|
model: carstenuhlig/omnicoder-9b
|
||||||
|
|
@ -62,7 +68,8 @@ agents:
|
||||||
system_prompt: .nightshift/agents/reviewer.md
|
system_prompt: .nightshift/agents/reviewer.md
|
||||||
|
|
||||||
pipeline:
|
pipeline:
|
||||||
max_task_retries: 3
|
max_task_retries: 6
|
||||||
|
stop_on_repeated_failure_signature_after: 6
|
||||||
continue_on_task_failure: false
|
continue_on_task_failure: false
|
||||||
stages:
|
stages:
|
||||||
- id: plan
|
- id: plan
|
||||||
|
|
@ -78,6 +85,35 @@ pipeline:
|
||||||
type: repo_context
|
type: repo_context
|
||||||
output: context-pack.md
|
output: context-pack.md
|
||||||
|
|
||||||
|
- id: write_tests
|
||||||
|
type: file_writer
|
||||||
|
agent: test_writer
|
||||||
|
output: proposed-tests.patch
|
||||||
|
|
||||||
|
- id: normalize_tests
|
||||||
|
type: patch_normalizer
|
||||||
|
output: normalized-tests.patch
|
||||||
|
|
||||||
|
- id: validate_tests_patch
|
||||||
|
type: patch_validator
|
||||||
|
output: test-patch-validation.md
|
||||||
|
max_files: 6
|
||||||
|
max_lines: 500
|
||||||
|
max_delete_ratio: 0.70
|
||||||
|
on_fail: write_tests
|
||||||
|
|
||||||
|
- id: apply_tests_patch
|
||||||
|
type: patch_apply
|
||||||
|
mode: apply
|
||||||
|
output: test-patch-apply-output.txt
|
||||||
|
on_fail: write_tests
|
||||||
|
|
||||||
|
- id: review_tests
|
||||||
|
type: agent_review
|
||||||
|
agent: reviewer
|
||||||
|
output: test-review.md
|
||||||
|
on_fail: write_tests
|
||||||
|
|
||||||
- id: implement
|
- id: implement
|
||||||
type: file_writer
|
type: file_writer
|
||||||
agent_pool:
|
agent_pool:
|
||||||
|
|
|
||||||
|
|
@ -22,17 +22,17 @@ def evaluate_retry_churn(
|
||||||
) -> EscalationDecision:
|
) -> EscalationDecision:
|
||||||
if len(entries) < 2:
|
if len(entries) < 2:
|
||||||
return EscalationDecision(False, "continue", "Not enough retry history for churn detection.")
|
return EscalationDecision(False, "continue", "Not enough retry history for churn detection.")
|
||||||
recent = entries[-3:]
|
churn_threshold = repeated_signature_after if repeated_signature_after and repeated_signature_after > 0 else 3
|
||||||
same_stage = len({entry.stage_id for entry in recent}) == 1
|
signature_window = entries[-churn_threshold:] if len(entries) >= churn_threshold else ()
|
||||||
same_cause = len({entry.cause for entry in recent}) == 1
|
recent_signatures = [entry.failure_signature for entry in signature_window if entry.failure_signature]
|
||||||
recent_signatures = [entry.failure_signature for entry in entries[-2:] if entry.failure_signature]
|
same_signature = len(recent_signatures) == churn_threshold and len(set(recent_signatures)) == 1
|
||||||
same_signature = len(recent_signatures) == 2 and len(set(recent_signatures)) == 1
|
stage_cause_window = entries[-churn_threshold:] if len(entries) >= churn_threshold else ()
|
||||||
|
same_stage = bool(stage_cause_window) and len({entry.stage_id for entry in stage_cause_window}) == 1
|
||||||
|
same_cause = bool(stage_cause_window) and len({entry.cause for entry in stage_cause_window}) == 1
|
||||||
if len(entries) >= retry_budget and retry_budget > 0:
|
if len(entries) >= retry_budget and retry_budget > 0:
|
||||||
return EscalationDecision(True, "human review", "Configured retry budget is exhausted.")
|
return EscalationDecision(True, "human review", "Configured retry budget is exhausted.")
|
||||||
if (
|
if (
|
||||||
repeated_signature_after is not None
|
len(entries) >= churn_threshold
|
||||||
and repeated_signature_after > 0
|
|
||||||
and len(entries) >= repeated_signature_after
|
|
||||||
and same_signature
|
and same_signature
|
||||||
):
|
):
|
||||||
return EscalationDecision(
|
return EscalationDecision(
|
||||||
|
|
@ -40,7 +40,7 @@ def evaluate_retry_churn(
|
||||||
"debugger review or larger model",
|
"debugger review or larger model",
|
||||||
"The same failure signature repeated on consecutive retries.",
|
"The same failure signature repeated on consecutive retries.",
|
||||||
)
|
)
|
||||||
if len(recent) == 3 and same_stage and same_cause:
|
if len(entries) >= churn_threshold and same_stage and same_cause:
|
||||||
return EscalationDecision(True, "debugger review or larger model", "The same stage is failing with the same reason repeatedly.")
|
return EscalationDecision(True, "debugger review or larger model", "The same stage is failing with the same reason repeatedly.")
|
||||||
return EscalationDecision(False, "continue", "No retry churn detected.")
|
return EscalationDecision(False, "continue", "No retry churn detected.")
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -251,8 +251,7 @@ class PipelineRunner:
|
||||||
decision = evaluate_retry_churn(
|
decision = evaluate_retry_churn(
|
||||||
tuple(retry_memory),
|
tuple(retry_memory),
|
||||||
retry_budget=self.config.pipeline.max_task_retries + 1,
|
retry_budget=self.config.pipeline.max_task_retries + 1,
|
||||||
repeated_signature_after=self.config.pipeline.stop_on_repeated_failure_signature_after
|
repeated_signature_after=self.config.pipeline.stop_on_repeated_failure_signature_after,
|
||||||
or self.config.pipeline.max_task_retries,
|
|
||||||
)
|
)
|
||||||
self.artifacts.write_stage_output(
|
self.artifacts.write_stage_output(
|
||||||
task.id,
|
task.id,
|
||||||
|
|
@ -592,8 +591,8 @@ class PipelineRunner:
|
||||||
f"# Implementation Summary\n\nStatus: fail\nReason: {exc}\n",
|
f"# Implementation Summary\n\nStatus: fail\nReason: {exc}\n",
|
||||||
)
|
)
|
||||||
return StageResult(stage.id, "fail", str(exc), output_path=result.output_path)
|
return StageResult(stage.id, "fail", str(exc), output_path=result.output_path)
|
||||||
patch_filename = "repair-{0}.patch".format(retry_count) if retry_count else (stage.output or "proposed.patch")
|
patch_filename = _writer_patch_filename(stage, retry_count)
|
||||||
summary_filename = "implementation-summary.md" if retry_count == 0 else f"repair-summary-{retry_count}.md"
|
summary_filename = _writer_summary_filename(stage, retry_count)
|
||||||
proposed_path = self.artifacts.write_stage_output(task.id, patch_filename, patch)
|
proposed_path = self.artifacts.write_stage_output(task.id, patch_filename, patch)
|
||||||
summary_path = self.artifacts.write_stage_output(
|
summary_path = self.artifacts.write_stage_output(
|
||||||
task.id,
|
task.id,
|
||||||
|
|
@ -728,7 +727,7 @@ class PipelineRunner:
|
||||||
try:
|
try:
|
||||||
patch = normalize_patch_text(stdout)
|
patch = normalize_patch_text(stdout)
|
||||||
except PipelineError:
|
except PipelineError:
|
||||||
summary_filename = "implementation-summary.md" if retry_count == 0 else f"repair-summary-{retry_count}.md"
|
summary_filename = _writer_summary_filename(stage, retry_count)
|
||||||
reason = str(exc)
|
reason = str(exc)
|
||||||
if "generated patch has no changes" in reason:
|
if "generated patch has no changes" in reason:
|
||||||
next_stage = self._stage_after_patch_flow(stage.id)
|
next_stage = self._stage_after_patch_flow(stage.id)
|
||||||
|
|
@ -758,8 +757,8 @@ class PipelineRunner:
|
||||||
patch_reason = "Fallback patch written from unified diff output."
|
patch_reason = "Fallback patch written from unified diff output."
|
||||||
log_message = "Wrote fallback patch from unified diff output"
|
log_message = "Wrote fallback patch from unified diff output"
|
||||||
break
|
break
|
||||||
patch_filename = "repair-{0}.patch".format(retry_count) if retry_count else (stage.output or "proposed.patch")
|
patch_filename = _writer_patch_filename(stage, retry_count)
|
||||||
summary_filename = "implementation-summary.md" if retry_count == 0 else f"repair-summary-{retry_count}.md"
|
summary_filename = _writer_summary_filename(stage, retry_count)
|
||||||
proposed_path = self.artifacts.write_stage_output(task.id, patch_filename, patch)
|
proposed_path = self.artifacts.write_stage_output(task.id, patch_filename, patch)
|
||||||
summary_path = self.artifacts.write_stage_output(
|
summary_path = self.artifacts.write_stage_output(
|
||||||
task.id,
|
task.id,
|
||||||
|
|
@ -1381,6 +1380,21 @@ def _latest_patch_like_output(previous_outputs: dict[str, str]) -> str:
|
||||||
raise PipelineError("Patch error: no previous patch output found.")
|
raise PipelineError("Patch error: no previous patch output found.")
|
||||||
|
|
||||||
|
|
||||||
|
def _writer_patch_filename(stage: StageConfig, retry_count: int) -> str:
|
||||||
|
if retry_count <= 0:
|
||||||
|
return stage.output or "proposed.patch"
|
||||||
|
if stage.type == "code_writer" or stage.id == "implement":
|
||||||
|
return f"repair-{retry_count}.patch"
|
||||||
|
return _attempt_filename(stage.output or f"{stage.id}.patch", retry_count)
|
||||||
|
|
||||||
|
|
||||||
|
def _writer_summary_filename(stage: StageConfig, retry_count: int) -> str:
|
||||||
|
if stage.type == "code_writer" or stage.id == "implement":
|
||||||
|
return "implementation-summary.md" if retry_count <= 0 else f"repair-summary-{retry_count}.md"
|
||||||
|
base = f"{stage.id}-summary.md"
|
||||||
|
return base if retry_count <= 0 else _attempt_filename(base, retry_count)
|
||||||
|
|
||||||
|
|
||||||
def _attempt_filename(filename: str, retry_count: int) -> str:
|
def _attempt_filename(filename: str, retry_count: int) -> str:
|
||||||
if retry_count <= 0:
|
if retry_count <= 0:
|
||||||
return filename
|
return filename
|
||||||
|
|
|
||||||
|
|
@ -38,7 +38,7 @@ agents:
|
||||||
system_prompt: agents/debugger.md
|
system_prompt: agents/debugger.md
|
||||||
|
|
||||||
pipeline:
|
pipeline:
|
||||||
max_task_retries: 3
|
max_task_retries: 6
|
||||||
stages:
|
stages:
|
||||||
- id: plan
|
- id: plan
|
||||||
type: agent
|
type: agent
|
||||||
|
|
|
||||||
|
|
@ -1,6 +1,9 @@
|
||||||
You are the debugger agent for the NightShift pastebin tutorial.
|
You are the debugger agent for the NightShift pastebin tutorial.
|
||||||
|
|
||||||
Diagnose failed attempts without editing files.
|
Diagnose failed attempts without editing files.
|
||||||
|
Distinguish inaccurate generated tests from implementation bugs.
|
||||||
|
If tests are inaccurate for the current task, recommend retrying `write_tests`.
|
||||||
|
If implementation is wrong, recommend the smallest implementation repair and name files that should not be modified.
|
||||||
Return:
|
Return:
|
||||||
- concise diagnosis
|
- concise diagnosis
|
||||||
- recommended next action
|
- recommended next action
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,11 @@
|
||||||
You are the implementation agent for the NightShift pastebin tutorial.
|
You are the implementation agent for the NightShift pastebin tutorial.
|
||||||
|
|
||||||
|
Implement the smallest application change that satisfies the current task and the generated tests.
|
||||||
|
Do not rewrite generated tests unless the retry context explicitly says they are inaccurate.
|
||||||
|
Do not add behavior for future tasks unless needed to satisfy the current tests.
|
||||||
|
Use Flask and sqlite from the standard library unless existing project files already introduce another framework.
|
||||||
|
Keep the public package name `pastebin_app`.
|
||||||
|
|
||||||
Output only complete file content blocks.
|
Output only complete file content blocks.
|
||||||
Use one fenced block per file:
|
Use one fenced block per file:
|
||||||
```file:relative/path.py
|
```file:relative/path.py
|
||||||
|
|
|
||||||
|
|
@ -1,7 +1,14 @@
|
||||||
You are the planning agent for the NightShift pastebin tutorial.
|
You are the planning agent for the NightShift pastebin tutorial.
|
||||||
|
|
||||||
Create a concise implementation plan for the current task.
|
Create a concise TDD implementation plan for the current task.
|
||||||
|
|
||||||
|
Plan in this order:
|
||||||
|
1. Which acceptance tests should be generated for only this task.
|
||||||
|
2. Which application files likely need to change.
|
||||||
|
3. The smallest implementation slice that should make those tests pass.
|
||||||
|
|
||||||
If repository context is needed, request it with lookup_requests.
|
If repository context is needed, request it with lookup_requests.
|
||||||
Prefer small edits and deterministic tests.
|
Prefer small edits and deterministic tests.
|
||||||
|
Do not assume files outside the configured scoped paths exist.
|
||||||
|
Do not propose SQLAlchemy unless existing repository files already use it.
|
||||||
Do not write code.
|
Do not write code.
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,8 @@
|
||||||
You are the review agent for the NightShift pastebin tutorial.
|
You are the review agent for the NightShift pastebin tutorial.
|
||||||
|
|
||||||
|
When reviewing generated tests, check that they map only to the current task acceptance criteria and do not require future-task behavior.
|
||||||
|
When reviewing implementation, check that the change is small, deterministic, and satisfies the generated tests without unrelated rewrites.
|
||||||
|
|
||||||
Output exactly:
|
Output exactly:
|
||||||
|
|
||||||
status: pass | fail | retry | escalate
|
status: pass | fail | retry | escalate
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,16 @@
|
||||||
|
You are the test-writing agent for the NightShift pastebin tutorial.
|
||||||
|
|
||||||
|
Write only tests for the current task's acceptance criteria.
|
||||||
|
Do not implement application code.
|
||||||
|
Do not add tests for future tasks or behavior not named in the current task.
|
||||||
|
|
||||||
|
Output only complete file content blocks.
|
||||||
|
Use one fenced block per file:
|
||||||
|
```file:relative/path.py
|
||||||
|
<complete file content>
|
||||||
|
```
|
||||||
|
|
||||||
|
Prefer pytest tests that describe the public behavior from the task.
|
||||||
|
Keep tests deterministic and isolated with temporary databases or temporary paths.
|
||||||
|
Use the existing package name `pastebin_app`.
|
||||||
|
If the app factory does not exist yet, write tests for the expected public interface that the implementer should create.
|
||||||
|
|
@ -55,3 +55,5 @@ The pipeline uses model fallback ordering for implementation attempts:
|
||||||
3. `deepseek-coder-v2:16b`
|
3. `deepseek-coder-v2:16b`
|
||||||
|
|
||||||
Telemetry artifacts record which agent/model handled each stage and estimate token usage.
|
Telemetry artifacts record which agent/model handled each stage and estimate token usage.
|
||||||
|
|
||||||
|
This template uses a TDD-oriented pipeline. It starts with a skeletal package, generates task-specific pytest tests from the current task acceptance criteria, reviews those tests for scope, and then implements only enough application code to pass them.
|
||||||
|
|
|
||||||
|
|
@ -21,7 +21,7 @@ safety:
|
||||||
|
|
||||||
experiment:
|
experiment:
|
||||||
label: pastebin-model-fallback
|
label: pastebin-model-fallback
|
||||||
prompt_variant: qwen-omnicoder-deepseek-v1
|
prompt_variant: tdd-qwen-omnicoder-deepseek-v2
|
||||||
|
|
||||||
agents:
|
agents:
|
||||||
planner:
|
planner:
|
||||||
|
|
@ -36,6 +36,12 @@ agents:
|
||||||
temperature: 0.1
|
temperature: 0.1
|
||||||
system_prompt: .nightshift/agents/implementer.md
|
system_prompt: .nightshift/agents/implementer.md
|
||||||
|
|
||||||
|
test_writer:
|
||||||
|
backend: ollama
|
||||||
|
model: qwen2.5-coder:14b
|
||||||
|
temperature: 0.1
|
||||||
|
system_prompt: .nightshift/agents/test-writer.md
|
||||||
|
|
||||||
implementer_omnicoder:
|
implementer_omnicoder:
|
||||||
backend: ollama
|
backend: ollama
|
||||||
model: carstenuhlig/omnicoder-9b
|
model: carstenuhlig/omnicoder-9b
|
||||||
|
|
@ -62,7 +68,8 @@ agents:
|
||||||
system_prompt: .nightshift/agents/reviewer.md
|
system_prompt: .nightshift/agents/reviewer.md
|
||||||
|
|
||||||
pipeline:
|
pipeline:
|
||||||
max_task_retries: 3
|
max_task_retries: 6
|
||||||
|
stop_on_repeated_failure_signature_after: 6
|
||||||
continue_on_task_failure: false
|
continue_on_task_failure: false
|
||||||
stages:
|
stages:
|
||||||
- id: plan
|
- id: plan
|
||||||
|
|
@ -78,6 +85,35 @@ pipeline:
|
||||||
type: repo_context
|
type: repo_context
|
||||||
output: context-pack.md
|
output: context-pack.md
|
||||||
|
|
||||||
|
- id: write_tests
|
||||||
|
type: file_writer
|
||||||
|
agent: test_writer
|
||||||
|
output: proposed-tests.patch
|
||||||
|
|
||||||
|
- id: normalize_tests
|
||||||
|
type: patch_normalizer
|
||||||
|
output: normalized-tests.patch
|
||||||
|
|
||||||
|
- id: validate_tests_patch
|
||||||
|
type: patch_validator
|
||||||
|
output: test-patch-validation.md
|
||||||
|
max_files: 6
|
||||||
|
max_lines: 500
|
||||||
|
max_delete_ratio: 0.70
|
||||||
|
on_fail: write_tests
|
||||||
|
|
||||||
|
- id: apply_tests_patch
|
||||||
|
type: patch_apply
|
||||||
|
mode: apply
|
||||||
|
output: test-patch-apply-output.txt
|
||||||
|
on_fail: write_tests
|
||||||
|
|
||||||
|
- id: review_tests
|
||||||
|
type: agent_review
|
||||||
|
agent: reviewer
|
||||||
|
output: test-review.md
|
||||||
|
on_fail: write_tests
|
||||||
|
|
||||||
- id: implement
|
- id: implement
|
||||||
type: file_writer
|
type: file_writer
|
||||||
agent_pool:
|
agent_pool:
|
||||||
|
|
|
||||||
|
|
@ -1,3 +1 @@
|
||||||
from .app import create_app
|
"""Pastebin package for the NightShift tutorial."""
|
||||||
|
|
||||||
__all__ = ["create_app"]
|
|
||||||
|
|
|
||||||
|
|
@ -1,128 +1 @@
|
||||||
from __future__ import annotations
|
"""Application code is generated by the NightShift tutorial tasks."""
|
||||||
|
|
||||||
from datetime import datetime, timezone
|
|
||||||
import sqlite3
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
from flask import Flask, abort, g, jsonify, redirect, render_template, request, url_for
|
|
||||||
|
|
||||||
|
|
||||||
SCHEMA = """
|
|
||||||
create table if not exists snippets (
|
|
||||||
id integer primary key autoincrement,
|
|
||||||
title text not null,
|
|
||||||
body text not null,
|
|
||||||
language text default '',
|
|
||||||
tags text default '',
|
|
||||||
expires_at text default '',
|
|
||||||
created_at text not null
|
|
||||||
);
|
|
||||||
"""
|
|
||||||
|
|
||||||
|
|
||||||
def create_app(database_path: str | Path | None = None) -> Flask:
|
|
||||||
app = Flask(__name__, template_folder=str(Path(__file__).resolve().parents[2] / "templates"))
|
|
||||||
app.config["DATABASE"] = str(database_path or Path(app.instance_path) / "pastebin.sqlite3")
|
|
||||||
|
|
||||||
@app.before_request
|
|
||||||
def _open_db() -> None:
|
|
||||||
Path(app.config["DATABASE"]).parent.mkdir(parents=True, exist_ok=True)
|
|
||||||
g.db = sqlite3.connect(app.config["DATABASE"])
|
|
||||||
g.db.row_factory = sqlite3.Row
|
|
||||||
g.db.execute(SCHEMA)
|
|
||||||
|
|
||||||
@app.teardown_request
|
|
||||||
def _close_db(exc) -> None:
|
|
||||||
db = g.pop("db", None)
|
|
||||||
if db is not None:
|
|
||||||
db.close()
|
|
||||||
|
|
||||||
@app.get("/")
|
|
||||||
def index():
|
|
||||||
snippets = list_snippets(g.db, request.args)
|
|
||||||
return render_template("index.html", snippets=snippets)
|
|
||||||
|
|
||||||
@app.get("/new")
|
|
||||||
def new_snippet():
|
|
||||||
return render_template("new.html")
|
|
||||||
|
|
||||||
@app.post("/snippets")
|
|
||||||
def create_snippet_route():
|
|
||||||
snippet_id = create_snippet(g.db, request.form or request.json or {})
|
|
||||||
wants_json = request.is_json or "application/json" in request.headers.get("Accept", "")
|
|
||||||
if wants_json:
|
|
||||||
return jsonify(get_snippet(g.db, snippet_id)), 201
|
|
||||||
return redirect(url_for("view_snippet", snippet_id=snippet_id))
|
|
||||||
|
|
||||||
@app.get("/snippets")
|
|
||||||
def list_snippets_route():
|
|
||||||
snippets = list_snippets(g.db, request.args)
|
|
||||||
if "application/json" in request.headers.get("Accept", ""):
|
|
||||||
return jsonify(snippets)
|
|
||||||
return render_template("index.html", snippets=snippets)
|
|
||||||
|
|
||||||
@app.get("/snippets/<int:snippet_id>")
|
|
||||||
def view_snippet(snippet_id: int):
|
|
||||||
snippet = get_snippet(g.db, snippet_id)
|
|
||||||
if snippet is None:
|
|
||||||
abort(404)
|
|
||||||
if is_expired(snippet):
|
|
||||||
abort(410)
|
|
||||||
if "application/json" in request.headers.get("Accept", ""):
|
|
||||||
return jsonify(snippet)
|
|
||||||
return render_template("view.html", snippet=snippet)
|
|
||||||
|
|
||||||
return app
|
|
||||||
|
|
||||||
|
|
||||||
def create_snippet(db: sqlite3.Connection, data) -> int:
|
|
||||||
title = str(data.get("title", "")).strip()
|
|
||||||
body = str(data.get("body", "")).strip()
|
|
||||||
if not title or not body:
|
|
||||||
raise ValueError("title and body are required")
|
|
||||||
cursor = db.execute(
|
|
||||||
"insert into snippets(title, body, language, tags, expires_at, created_at) values (?, ?, ?, ?, ?, ?)",
|
|
||||||
(
|
|
||||||
title,
|
|
||||||
body,
|
|
||||||
str(data.get("language", "")).strip(),
|
|
||||||
str(data.get("tags", "")).strip(),
|
|
||||||
str(data.get("expires_at", "")).strip(),
|
|
||||||
datetime.now(timezone.utc).isoformat(),
|
|
||||||
),
|
|
||||||
)
|
|
||||||
db.commit()
|
|
||||||
return int(cursor.lastrowid)
|
|
||||||
|
|
||||||
|
|
||||||
def get_snippet(db: sqlite3.Connection, snippet_id: int) -> dict | None:
|
|
||||||
row = db.execute("select * from snippets where id = ?", (snippet_id,)).fetchone()
|
|
||||||
return dict(row) if row else None
|
|
||||||
|
|
||||||
|
|
||||||
def list_snippets(db: sqlite3.Connection, args) -> list[dict]:
|
|
||||||
rows = db.execute("select * from snippets order by id desc").fetchall()
|
|
||||||
snippets = [dict(row) for row in rows if not is_expired(dict(row))]
|
|
||||||
query = str(args.get("q", "")).lower()
|
|
||||||
language = str(args.get("language", "")).lower()
|
|
||||||
tag = str(args.get("tag", "")).lower()
|
|
||||||
if query:
|
|
||||||
snippets = [item for item in snippets if query in item["title"].lower() or query in item["body"].lower()]
|
|
||||||
if language:
|
|
||||||
snippets = [item for item in snippets if item["language"].lower() == language]
|
|
||||||
if tag:
|
|
||||||
snippets = [item for item in snippets if tag in [part.strip().lower() for part in item["tags"].split(",")]]
|
|
||||||
return snippets
|
|
||||||
|
|
||||||
|
|
||||||
def is_expired(snippet: dict) -> bool:
|
|
||||||
value = snippet.get("expires_at") or ""
|
|
||||||
if not value:
|
|
||||||
return False
|
|
||||||
try:
|
|
||||||
expires = datetime.fromisoformat(value)
|
|
||||||
except ValueError:
|
|
||||||
return False
|
|
||||||
if expires.tzinfo is None:
|
|
||||||
expires = expires.replace(tzinfo=timezone.utc)
|
|
||||||
return expires <= datetime.now(timezone.utc)
|
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1 @@
|
||||||
|
|
||||||
|
|
@ -1,18 +0,0 @@
|
||||||
<!doctype html>
|
|
||||||
<html lang="en">
|
|
||||||
<body>
|
|
||||||
<h1>Snippets</h1>
|
|
||||||
<a href="/new">New snippet</a>
|
|
||||||
<form method="get" action="/snippets">
|
|
||||||
<input name="q" placeholder="Search">
|
|
||||||
<input name="language" placeholder="Language">
|
|
||||||
<input name="tag" placeholder="Tag">
|
|
||||||
<button type="submit">Filter</button>
|
|
||||||
</form>
|
|
||||||
<ul>
|
|
||||||
{% for snippet in snippets %}
|
|
||||||
<li><a href="/snippets/{{ snippet.id }}">{{ snippet.title }}</a> {{ snippet.language }} {{ snippet.tags }}</li>
|
|
||||||
{% endfor %}
|
|
||||||
</ul>
|
|
||||||
</body>
|
|
||||||
</html>
|
|
||||||
|
|
@ -1,14 +0,0 @@
|
||||||
<!doctype html>
|
|
||||||
<html lang="en">
|
|
||||||
<body>
|
|
||||||
<h1>New Snippet</h1>
|
|
||||||
<form method="post" action="/snippets">
|
|
||||||
<input name="title" placeholder="Title" required>
|
|
||||||
<textarea name="body" required></textarea>
|
|
||||||
<input name="language" placeholder="Language">
|
|
||||||
<input name="tags" placeholder="Tags">
|
|
||||||
<input name="expires_at" placeholder="Expires at ISO timestamp">
|
|
||||||
<button type="submit">Create</button>
|
|
||||||
</form>
|
|
||||||
</body>
|
|
||||||
</html>
|
|
||||||
|
|
@ -1,8 +0,0 @@
|
||||||
<!doctype html>
|
|
||||||
<html lang="en">
|
|
||||||
<body>
|
|
||||||
<h1>{{ snippet.title }}</h1>
|
|
||||||
<p>{{ snippet.language }} {{ snippet.tags }}</p>
|
|
||||||
<pre>{{ snippet.body }}</pre>
|
|
||||||
</body>
|
|
||||||
</html>
|
|
||||||
|
|
@ -0,0 +1 @@
|
||||||
|
|
||||||
|
|
@ -1,51 +0,0 @@
|
||||||
from datetime import datetime, timedelta, timezone
|
|
||||||
|
|
||||||
from pastebin_app import create_app
|
|
||||||
|
|
||||||
|
|
||||||
def client(tmp_path):
|
|
||||||
app = create_app(tmp_path / "pastebin.sqlite3")
|
|
||||||
app.config["TESTING"] = True
|
|
||||||
return app.test_client()
|
|
||||||
|
|
||||||
|
|
||||||
def test_create_and_view_snippet(tmp_path):
|
|
||||||
test_client = client(tmp_path)
|
|
||||||
response = test_client.post(
|
|
||||||
"/snippets",
|
|
||||||
json={"title": "Hello", "body": "print('hi')", "language": "python", "tags": "demo,test"},
|
|
||||||
headers={"Accept": "application/json"},
|
|
||||||
)
|
|
||||||
|
|
||||||
assert response.status_code == 201
|
|
||||||
snippet_id = response.get_json()["id"]
|
|
||||||
view = test_client.get(f"/snippets/{snippet_id}", headers={"Accept": "application/json"})
|
|
||||||
assert view.status_code == 200
|
|
||||||
assert view.get_json()["language"] == "python"
|
|
||||||
|
|
||||||
|
|
||||||
def test_list_search_and_filters(tmp_path):
|
|
||||||
test_client = client(tmp_path)
|
|
||||||
test_client.post("/snippets", json={"title": "Python note", "body": "flask route", "language": "python", "tags": "web"})
|
|
||||||
test_client.post("/snippets", json={"title": "SQL note", "body": "select", "language": "sql", "tags": "data"})
|
|
||||||
|
|
||||||
search = test_client.get("/snippets?q=flask", headers={"Accept": "application/json"}).get_json()
|
|
||||||
language = test_client.get("/snippets?language=sql", headers={"Accept": "application/json"}).get_json()
|
|
||||||
tag = test_client.get("/snippets?tag=web", headers={"Accept": "application/json"}).get_json()
|
|
||||||
|
|
||||||
assert [item["title"] for item in search] == ["Python note"]
|
|
||||||
assert [item["title"] for item in language] == ["SQL note"]
|
|
||||||
assert [item["title"] for item in tag] == ["Python note"]
|
|
||||||
|
|
||||||
|
|
||||||
def test_expired_snippet_hidden_and_direct_lookup_gone(tmp_path):
|
|
||||||
test_client = client(tmp_path)
|
|
||||||
expired = (datetime.now(timezone.utc) - timedelta(days=1)).isoformat()
|
|
||||||
response = test_client.post("/snippets", json={"title": "Old", "body": "gone", "expires_at": expired}, headers={"Accept": "application/json"})
|
|
||||||
snippet_id = response.get_json()["id"]
|
|
||||||
|
|
||||||
listed = test_client.get("/snippets", headers={"Accept": "application/json"}).get_json()
|
|
||||||
direct = test_client.get(f"/snippets/{snippet_id}", headers={"Accept": "application/json"})
|
|
||||||
|
|
||||||
assert listed == []
|
|
||||||
assert direct.status_code == 410
|
|
||||||
|
|
@ -40,7 +40,7 @@ agents:
|
||||||
system_prompt: agents/debugger.md
|
system_prompt: agents/debugger.md
|
||||||
|
|
||||||
pipeline:
|
pipeline:
|
||||||
max_task_retries: 3
|
max_task_retries: 6
|
||||||
stages:
|
stages:
|
||||||
- id: plan
|
- id: plan
|
||||||
type: agent
|
type: agent
|
||||||
|
|
@ -195,7 +195,7 @@ agents:
|
||||||
system_prompt: .nightshift/agents/debugger.md
|
system_prompt: .nightshift/agents/debugger.md
|
||||||
|
|
||||||
pipeline:
|
pipeline:
|
||||||
max_task_retries: 3
|
max_task_retries: 6
|
||||||
continue_on_task_failure: false
|
continue_on_task_failure: false
|
||||||
stages:
|
stages:
|
||||||
- id: plan
|
- id: plan
|
||||||
|
|
|
||||||
|
|
@ -17,7 +17,7 @@ class ConfigTests(unittest.TestCase):
|
||||||
|
|
||||||
self.assertEqual(config.project.name, "example-project")
|
self.assertEqual(config.project.name, "example-project")
|
||||||
self.assertIn("planner", config.agents)
|
self.assertIn("planner", config.agents)
|
||||||
self.assertEqual(config.pipeline.max_task_retries, 3)
|
self.assertEqual(config.pipeline.max_task_retries, 6)
|
||||||
self.assertEqual(config.pipeline.stages[0].id, "plan")
|
self.assertEqual(config.pipeline.stages[0].id, "plan")
|
||||||
|
|
||||||
def test_missing_required_section_fails_clearly(self) -> None:
|
def test_missing_required_section_fails_clearly(self) -> None:
|
||||||
|
|
@ -86,7 +86,7 @@ class ConfigTests(unittest.TestCase):
|
||||||
config_path = root / "nightshift.yaml"
|
config_path = root / "nightshift.yaml"
|
||||||
config_path.write_text(
|
config_path.write_text(
|
||||||
config_path.read_text(encoding="utf-8").replace(
|
config_path.read_text(encoding="utf-8").replace(
|
||||||
"max_task_retries: 3",
|
"max_task_retries: 6",
|
||||||
"max_task_retries: three",
|
"max_task_retries: three",
|
||||||
),
|
),
|
||||||
encoding="utf-8",
|
encoding="utf-8",
|
||||||
|
|
|
||||||
|
|
@ -61,7 +61,7 @@ class InitProjectTests(unittest.TestCase):
|
||||||
self.assertIn("tutorial-imageboard", available_templates())
|
self.assertIn("tutorial-imageboard", available_templates())
|
||||||
self.assertIn("tutorial-pastebin", available_templates())
|
self.assertIn("tutorial-pastebin", available_templates())
|
||||||
|
|
||||||
def test_init_pastebin_template_creates_app_and_model_fallback_config(self) -> None:
|
def test_init_pastebin_template_creates_skeleton_and_tdd_model_fallback_config(self) -> None:
|
||||||
with tempfile.TemporaryDirectory() as directory:
|
with tempfile.TemporaryDirectory() as directory:
|
||||||
root = Path(directory)
|
root = Path(directory)
|
||||||
|
|
||||||
|
|
@ -69,9 +69,14 @@ class InitProjectTests(unittest.TestCase):
|
||||||
|
|
||||||
config = (root / "nightshift.yaml").read_text(encoding="utf-8")
|
config = (root / "nightshift.yaml").read_text(encoding="utf-8")
|
||||||
self.assertTrue((root / ".nightshift" / "tasks.md").exists())
|
self.assertTrue((root / ".nightshift" / "tasks.md").exists())
|
||||||
|
self.assertTrue((root / ".nightshift" / "agents" / "test-writer.md").exists())
|
||||||
self.assertTrue((root / "src" / "pastebin_app" / "app.py").exists())
|
self.assertTrue((root / "src" / "pastebin_app" / "app.py").exists())
|
||||||
self.assertTrue((root / "tests" / "test_pastebin.py").exists())
|
self.assertTrue((root / "tests" / ".gitkeep").exists())
|
||||||
|
self.assertFalse((root / "tests" / "test_pastebin.py").exists())
|
||||||
self.assertIn("type: semantic_context", config)
|
self.assertIn("type: semantic_context", config)
|
||||||
|
self.assertIn("id: write_tests", config)
|
||||||
|
self.assertIn("id: review_tests", config)
|
||||||
|
self.assertIn("max_task_retries: 6", config)
|
||||||
self.assertIn("implementer_qwen", config)
|
self.assertIn("implementer_qwen", config)
|
||||||
self.assertIn("carstenuhlig/omnicoder-9b", config)
|
self.assertIn("carstenuhlig/omnicoder-9b", config)
|
||||||
self.assertIn("deepseek-coder-v2:16b", config)
|
self.assertIn("deepseek-coder-v2:16b", config)
|
||||||
|
|
|
||||||
|
|
@ -78,6 +78,23 @@ class ReliabilityFeatureTests(unittest.TestCase):
|
||||||
self.assertTrue(decision.should_stop)
|
self.assertTrue(decision.should_stop)
|
||||||
self.assertIn("same failure signature", decision.reason)
|
self.assertIn("same failure signature", decision.reason)
|
||||||
|
|
||||||
|
def test_retry_churn_honors_configured_repeated_failure_threshold(self) -> None:
|
||||||
|
entries = tuple(
|
||||||
|
RetryMemoryEntry(
|
||||||
|
attempt=attempt,
|
||||||
|
stage_id="test",
|
||||||
|
status="fail",
|
||||||
|
cause="Command exited with code 1: python -m pytest -q",
|
||||||
|
next_stage="implement",
|
||||||
|
failure_signature="NameError | src/pastebin_app/app.py | 31 | python -m pytest -q",
|
||||||
|
)
|
||||||
|
for attempt in range(1, 4)
|
||||||
|
)
|
||||||
|
|
||||||
|
decision = evaluate_retry_churn(entries, retry_budget=7, repeated_signature_after=6)
|
||||||
|
|
||||||
|
self.assertFalse(decision.should_stop)
|
||||||
|
|
||||||
def test_build_failure_signature_prefers_project_traceback_over_pytest_cache(self) -> None:
|
def test_build_failure_signature_prefers_project_traceback_over_pytest_cache(self) -> None:
|
||||||
signature = build_failure_signature(
|
signature = build_failure_signature(
|
||||||
"\n".join(
|
"\n".join(
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue
Block a user