mirror of https://github.com/khodges42/nightShift.git synced 2026-06-14 10:08:37 +00:00

K. Hodges 3bb5bd4157 Fixes based on tests, do tdd

Changed the pastebin tutorial so it now starts skeletal: no prebuilt Flask behavior, no pre-generated task tests, and .gitkeep placeholders under templates/ and tests/. The new pipeline  in nightshift/project_templates/tutorial-pastebin/nightshift.yaml:1 now runs:

  plan -> semantic_context -> context -> write_tests -> review_tests -> implement -> pytest -> review
                                                                                                                                                                                           ────────────────────────────────────────────────────
  Added nightshift/project_templates/tutorial-pastebin/.nightshift/agents/test-writer.md:1, tightened the planner/implementer/reviewer/debugger prompts, mirrored the pipeline docs/
  example, and raised default retries to 6 for the basic starter plus pastebin.

  I also fixed the retry policy issue in nightshift/escalation.py:17 and nightshift/pipeline.py:251: configured repeated-failure thresholds are now respected instead of hard-stopping      in nightshift/project_templates/tutorial-pastebin/
  early after three same-stage/same-cause failures. Non-implementation file_writer stages now get stage-specific retry artifacts so test generation does not collide with implementation
  repair artifacts

2026-05-20 21:51:40 -07:00

4.0 KiB

Raw Blame History

NightShift Integration Failure Analysis

Immediate Causes

I would separate the failures into four buckets:

The pastebin template is not truly incremental. tests/test_pastebin.py already tests listing/filtering and expiration, even though TASK-001 only asks for create/view. The stock app also already has a fairly complete create_app implementation. So the task is not "build feature 1"; it is "modify an already-complete app without breaking future-task behavior."
The retry stop policy is harsher than the config implies. Even with stop_on_repeated_failure_signature_after: 6, nightshift/escalation.py unconditionally stops after the last 3 entries have the same stage and cause. That explains the "same stage same reason" stop before the configured repeated-signature threshold.
The model got bad or insufficient context early. In the run artifacts, the planner asked for app/models.py and app/routes.py, both outside the actual scoped repo. That pushed it toward a hallucinated Flask/SQLAlchemy architecture. Later repairs added tests/test_snippets.py importing nonexistent app, then tried to repair by deleting large amounts of code, which patch validation correctly rejected.
The template and manual deletion created contradictory state. In the latest project, src/pastebin_app/__init__.py imports create_app, but src/pastebin_app/app.py no longer defines it. tests/test_pastebin.py is now empty, while generated tests/test_snippets.py expects a different app shape. That is exactly the kind of broken intermediate state a local model will churn on unless the orchestrator gives it a very explicit recovery path.

On Pre-Generated Code

I agree with your instinct: for this tutorial, pre-generated app code is hurting more than helping.

A better template would include:

pyproject.toml
package directories and empty __init__.py
minimal templates if the task needs HTML later
no complete app logic
no future-task tests active during TASK-001
a small tests/test_task001.py for only create/view

Then TASK-002 adds list/filter tests, TASK-003 adds expiration tests, etc. The AI should build forward, not preserve a hidden completed app.

Why Claude/Codex Feel Different

Production coding agents usually have an inner loop:

inspect files
edit narrowly
run targeted tests
read exact failure
inspect more files
edit again
rerun

NightShift currently has a coarser loop: generate one patch, normalize, apply, run tests, summarize, retry. That is auditable, but it means each retry is another sampled patch rather than an interactive repair session. Swapping models does not fix bad task shape, bad context, or contradictory repo state.

Best Options

Option A: fix the current design conservatively.

Remove pre-generated pastebin app logic.
Split tests by task.
Run only task-relevant tests during the task, then full suite after success.
Move deterministic repo context before planning, or at least always include file tree plus full contents of likely target files.
Make churn stopping obey config; do not hard-stop after 3 same-stage failures unless configured.
Improve retry signatures to ignore pytest cache warnings and prefer project traceback lines.

Option B: add a real repair micro-loop.

For command/test failures, run a bounded repair loop before consuming another global retry:

failure -> classify -> inspect exact files -> produce small patch -> run targeted test -> repeat 2-4 times

That would make NightShift behave more like Codex/Claude while preserving artifacts.

Option C: delegate hard repairs to production agent backends.

Add a codex/claude-code backend stage for implementation/repair. NightShift still owns task selection, safety, artifacts, tests, and reports, but lets a stronger tool run the inner edit/test loop.

My recommendation: do A first, then B. The template/task mismatch is the largest avoidable failure source, and the unconditional churn stop is a real policy bug. Once those are fixed, the remaining failures will be much more informative.

4.0 KiB Raw Blame History