nightshift/docs/codex/20260520-203827.md

# NightShift Integration Failure Analysis

## Immediate Causes

I would separate the failures into four buckets:

1. The pastebin template is not truly incremental.
   `tests/test_pastebin.py` already tests listing/filtering and expiration, even though `TASK-001` only asks for create/view. The stock app also already has a fairly complete `create_app` implementation. So the task is not "build feature 1"; it is "modify an already-complete app without breaking future-task behavior."

2. The retry stop policy is harsher than the config implies.
   Even with `stop_on_repeated_failure_signature_after: 6`, `nightshift/escalation.py` unconditionally stops after the last 3 entries have the same stage and cause. That explains the "same stage same reason" stop before the configured repeated-signature threshold.

3. The model got bad or insufficient context early.
   In the run artifacts, the planner asked for `app/models.py` and `app/routes.py`, both outside the actual scoped repo. That pushed it toward a hallucinated Flask/SQLAlchemy architecture. Later repairs added `tests/test_snippets.py` importing nonexistent `app`, then tried to repair by deleting large amounts of code, which patch validation correctly rejected.

4. The template and manual deletion created contradictory state.
   In the latest project, `src/pastebin_app/__init__.py` imports `create_app`, but `src/pastebin_app/app.py` no longer defines it. `tests/test_pastebin.py` is now empty, while generated `tests/test_snippets.py` expects a different app shape. That is exactly the kind of broken intermediate state a local model will churn on unless the orchestrator gives it a very explicit recovery path.

## On Pre-Generated Code

I agree with your instinct: for this tutorial, pre-generated app code is hurting more than helping.

A better template would include:

- `pyproject.toml`
- package directories and empty `__init__.py`
- minimal templates if the task needs HTML later
- no complete app logic
- no future-task tests active during `TASK-001`
- a small `tests/test_task001.py` for only create/view

Then `TASK-002` adds list/filter tests, `TASK-003` adds expiration tests, etc. The AI should build forward, not preserve a hidden completed app.

## Why Claude/Codex Feel Different

Production coding agents usually have an inner loop:

- inspect files
- edit narrowly
- run targeted tests
- read exact failure
- inspect more files
- edit again
- rerun

NightShift currently has a coarser loop: generate one patch, normalize, apply, run tests, summarize, retry. That is auditable, but it means each retry is another sampled patch rather than an interactive repair session. Swapping models does not fix bad task shape, bad context, or contradictory repo state.

## Best Options

Option A: fix the current design conservatively.

- Remove pre-generated pastebin app logic.
- Split tests by task.
- Run only task-relevant tests during the task, then full suite after success.
- Move deterministic repo context before planning, or at least always include file tree plus full contents of likely target files.
- Make churn stopping obey config; do not hard-stop after 3 same-stage failures unless configured.
- Improve retry signatures to ignore pytest cache warnings and prefer project traceback lines.

Option B: add a real repair micro-loop.

For command/test failures, run a bounded repair loop before consuming another global retry:

```text
failure -> classify -> inspect exact files -> produce small patch -> run targeted test -> repeat 2-4 times
```

That would make NightShift behave more like Codex/Claude while preserving artifacts.

Option C: delegate hard repairs to production agent backends.

Add a `codex`/`claude-code` backend stage for implementation/repair. NightShift still owns task selection, safety, artifacts, tests, and reports, but lets a stronger tool run the inner edit/test loop.

My recommendation: do A first, then B. The template/task mismatch is the largest avoidable failure source, and the unconditional churn stop is a real policy bug. Once those are fixed, the remaining failures will be much more informative.