Changed the pastebin tutorial so it now starts skeletal: no prebuilt Flask behavior, no pre-generated task tests, and .gitkeep placeholders under templates/ and tests/. The new pipeline in nightshift/project_templates/tutorial-pastebin/nightshift.yaml:1 now runs:
plan -> semantic_context -> context -> write_tests -> review_tests -> implement -> pytest -> review
────────────────────────────────────────────────────
Added nightshift/project_templates/tutorial-pastebin/.nightshift/agents/test-writer.md:1, tightened the planner/implementer/reviewer/debugger prompts, mirrored the pipeline docs/
example, and raised default retries to 6 for the basic starter plus pastebin.
I also fixed the retry policy issue in nightshift/escalation.py:17 and nightshift/pipeline.py:251: configured repeated-failure thresholds are now respected instead of hard-stopping in nightshift/project_templates/tutorial-pastebin/
early after three same-stage/same-cause failures. Non-implementation file_writer stages now get stage-specific retry artifacts so test generation does not collide with implementation
repair artifacts
4.0 KiB
NightShift Integration Failure Analysis
Immediate Causes
I would separate the failures into four buckets:
-
The pastebin template is not truly incremental.
tests/test_pastebin.pyalready tests listing/filtering and expiration, even thoughTASK-001only asks for create/view. The stock app also already has a fairly completecreate_appimplementation. So the task is not "build feature 1"; it is "modify an already-complete app without breaking future-task behavior." -
The retry stop policy is harsher than the config implies. Even with
stop_on_repeated_failure_signature_after: 6,nightshift/escalation.pyunconditionally stops after the last 3 entries have the same stage and cause. That explains the "same stage same reason" stop before the configured repeated-signature threshold. -
The model got bad or insufficient context early. In the run artifacts, the planner asked for
app/models.pyandapp/routes.py, both outside the actual scoped repo. That pushed it toward a hallucinated Flask/SQLAlchemy architecture. Later repairs addedtests/test_snippets.pyimporting nonexistentapp, then tried to repair by deleting large amounts of code, which patch validation correctly rejected. -
The template and manual deletion created contradictory state. In the latest project,
src/pastebin_app/__init__.pyimportscreate_app, butsrc/pastebin_app/app.pyno longer defines it.tests/test_pastebin.pyis now empty, while generatedtests/test_snippets.pyexpects a different app shape. That is exactly the kind of broken intermediate state a local model will churn on unless the orchestrator gives it a very explicit recovery path.
On Pre-Generated Code
I agree with your instinct: for this tutorial, pre-generated app code is hurting more than helping.
A better template would include:
pyproject.toml- package directories and empty
__init__.py - minimal templates if the task needs HTML later
- no complete app logic
- no future-task tests active during
TASK-001 - a small
tests/test_task001.pyfor only create/view
Then TASK-002 adds list/filter tests, TASK-003 adds expiration tests, etc. The AI should build forward, not preserve a hidden completed app.
Why Claude/Codex Feel Different
Production coding agents usually have an inner loop:
- inspect files
- edit narrowly
- run targeted tests
- read exact failure
- inspect more files
- edit again
- rerun
NightShift currently has a coarser loop: generate one patch, normalize, apply, run tests, summarize, retry. That is auditable, but it means each retry is another sampled patch rather than an interactive repair session. Swapping models does not fix bad task shape, bad context, or contradictory repo state.
Best Options
Option A: fix the current design conservatively.
- Remove pre-generated pastebin app logic.
- Split tests by task.
- Run only task-relevant tests during the task, then full suite after success.
- Move deterministic repo context before planning, or at least always include file tree plus full contents of likely target files.
- Make churn stopping obey config; do not hard-stop after 3 same-stage failures unless configured.
- Improve retry signatures to ignore pytest cache warnings and prefer project traceback lines.
Option B: add a real repair micro-loop.
For command/test failures, run a bounded repair loop before consuming another global retry:
failure -> classify -> inspect exact files -> produce small patch -> run targeted test -> repeat 2-4 times
That would make NightShift behave more like Codex/Claude while preserving artifacts.
Option C: delegate hard repairs to production agent backends.
Add a codex/claude-code backend stage for implementation/repair. NightShift still owns task selection, safety, artifacts, tests, and reports, but lets a stronger tool run the inner edit/test loop.
My recommendation: do A first, then B. The template/task mismatch is the largest avoidable failure source, and the unconditional churn stop is a real policy bug. Once those are fixed, the remaining failures will be much more informative.