nightshift/docs/codex/20260520-203827.md
K. Hodges 3bb5bd4157 Fixes based on tests, do tdd
Changed the pastebin tutorial so it now starts skeletal: no prebuilt Flask behavior, no pre-generated task tests, and .gitkeep placeholders under templates/ and tests/. The new pipeline  in nightshift/project_templates/tutorial-pastebin/nightshift.yaml:1 now runs:

  plan -> semantic_context -> context -> write_tests -> review_tests -> implement -> pytest -> review
                                                                                                                                                                                           ────────────────────────────────────────────────────
  Added nightshift/project_templates/tutorial-pastebin/.nightshift/agents/test-writer.md:1, tightened the planner/implementer/reviewer/debugger prompts, mirrored the pipeline docs/
  example, and raised default retries to 6 for the basic starter plus pastebin.

  I also fixed the retry policy issue in nightshift/escalation.py:17 and nightshift/pipeline.py:251: configured repeated-failure thresholds are now respected instead of hard-stopping      in nightshift/project_templates/tutorial-pastebin/
  early after three same-stage/same-cause failures. Non-implementation file_writer stages now get stage-specific retry artifacts so test generation does not collide with implementation
  repair artifacts
2026-05-20 21:51:40 -07:00

74 lines
4.0 KiB
Markdown

# NightShift Integration Failure Analysis
## Immediate Causes
I would separate the failures into four buckets:
1. The pastebin template is not truly incremental.
`tests/test_pastebin.py` already tests listing/filtering and expiration, even though `TASK-001` only asks for create/view. The stock app also already has a fairly complete `create_app` implementation. So the task is not "build feature 1"; it is "modify an already-complete app without breaking future-task behavior."
2. The retry stop policy is harsher than the config implies.
Even with `stop_on_repeated_failure_signature_after: 6`, `nightshift/escalation.py` unconditionally stops after the last 3 entries have the same stage and cause. That explains the "same stage same reason" stop before the configured repeated-signature threshold.
3. The model got bad or insufficient context early.
In the run artifacts, the planner asked for `app/models.py` and `app/routes.py`, both outside the actual scoped repo. That pushed it toward a hallucinated Flask/SQLAlchemy architecture. Later repairs added `tests/test_snippets.py` importing nonexistent `app`, then tried to repair by deleting large amounts of code, which patch validation correctly rejected.
4. The template and manual deletion created contradictory state.
In the latest project, `src/pastebin_app/__init__.py` imports `create_app`, but `src/pastebin_app/app.py` no longer defines it. `tests/test_pastebin.py` is now empty, while generated `tests/test_snippets.py` expects a different app shape. That is exactly the kind of broken intermediate state a local model will churn on unless the orchestrator gives it a very explicit recovery path.
## On Pre-Generated Code
I agree with your instinct: for this tutorial, pre-generated app code is hurting more than helping.
A better template would include:
- `pyproject.toml`
- package directories and empty `__init__.py`
- minimal templates if the task needs HTML later
- no complete app logic
- no future-task tests active during `TASK-001`
- a small `tests/test_task001.py` for only create/view
Then `TASK-002` adds list/filter tests, `TASK-003` adds expiration tests, etc. The AI should build forward, not preserve a hidden completed app.
## Why Claude/Codex Feel Different
Production coding agents usually have an inner loop:
- inspect files
- edit narrowly
- run targeted tests
- read exact failure
- inspect more files
- edit again
- rerun
NightShift currently has a coarser loop: generate one patch, normalize, apply, run tests, summarize, retry. That is auditable, but it means each retry is another sampled patch rather than an interactive repair session. Swapping models does not fix bad task shape, bad context, or contradictory repo state.
## Best Options
Option A: fix the current design conservatively.
- Remove pre-generated pastebin app logic.
- Split tests by task.
- Run only task-relevant tests during the task, then full suite after success.
- Move deterministic repo context before planning, or at least always include file tree plus full contents of likely target files.
- Make churn stopping obey config; do not hard-stop after 3 same-stage failures unless configured.
- Improve retry signatures to ignore pytest cache warnings and prefer project traceback lines.
Option B: add a real repair micro-loop.
For command/test failures, run a bounded repair loop before consuming another global retry:
```text
failure -> classify -> inspect exact files -> produce small patch -> run targeted test -> repeat 2-4 times
```
That would make NightShift behave more like Codex/Claude while preserving artifacts.
Option C: delegate hard repairs to production agent backends.
Add a `codex`/`claude-code` backend stage for implementation/repair. NightShift still owns task selection, safety, artifacts, tests, and reports, but lets a stronger tool run the inner edit/test loop.
My recommendation: do A first, then B. The template/task mismatch is the largest avoidable failure source, and the unconditional churn stop is a real policy bug. Once those are fixed, the remaining failures will be much more informative.