Interject/nightshift

Fork 0

mirror of https://github.com/khodges42/nightShift.git synced 2026-06-14 10:08:37 +00:00

K. Hodges 6e03430a33 Recover usable drafts from failed runs

2026-05-22 02:55:26 -07:00

5.5 KiB

Raw Blame History

Ideas TODO

This file tracks open ideas only. Completed items should be removed after they land.

Priority scale:

P0: do next; directly improves current feedback loop
P1: important after the current loop is usable
P2: useful, but only after basics are stable
P3: defer or maybe reject

P1: Add Test Governance For Generated Tests

Use this only for generated-test mode. Do not put generated tests back into the default DeadDrop fixed-test pipeline yet.

The previous failures proved test-writing agents will:

edit app code
import nonexistent modules
require undeclared dependencies
inspect implementation internals
write tests for future behavior

Governance should be deterministic first, model-reviewed second.

Deterministic checks:

test-writing stage may only touch tests/
tests compile
tests import only allowed public interfaces
tests do not import undeclared dependencies
tests do not define Flask routes or app implementation
test names match current task id or current artifact
no future-task keywords unless accepted by current task acceptance criteria

Then optional model reviewer checks acceptance-criteria alignment.

P0: Preserve Good Drafts During Repair

When a generated file block contains useful allowed content plus disallowed or invalid extra content, avoid redrafting from scratch.

Possible behavior:

keep the allowed candidate file artifact
strip disallowed file blocks only when configured as safe for that stage
continue with validation for the allowed content
or ask the model for a minimal correction that preserves the accepted candidate

For writing workflows, preserving a good scene is more valuable than forcing a full retry.

P0: Remove Runtime Overrides For Custom Ollama Models

If a model is a tuned local Ollama model such as nightshift-writer or nightshift-base, prefer the Modelfile parameters unless the stage has a specific reason to override them.

Candidate config cleanup:

remove temperature
remove num_ctx
remove num_predict
remove stop if present

This avoids NightShift accidentally overriding tuned custom-model behavior.

P1: Improve `what-happened` For Model Runs

The report should identify usable intermediate work, not only final failure state.

Examples:

model produced a valid scene candidate
validation rejected extra state files
recover candidate from candidate-files/<stage>/index.md
retry output was invalid or too short
next recommended action

This should make failed creative-writing runs reviewable without manually reading every artifact.

P1: Add Stage-Specific Task Views

The same task may say both "write scene" and "update state", but those responsibilities belong to different stages.

Stage prompts should receive a filtered task view:

drafter sees only scene-writing criteria
state updater sees only durable state update criteria
reviewers see criteria relevant to their review role

This reduces prompt contradiction and makes deterministic stage rules easier for models to follow.

P1: Preserve Intra-Attempt Rerun Artifacts

When NightShift re-runs an agent inside the same stage attempt, do not overwrite the previous artifact.

Examples:

draft_scene-agent-output.md
draft_scene-agent-output-invalid-rerun-1.md
draft_scene-agent-output-1.md

This keeps the initial useful output visible even when strict rerun output is worse.

P1: Add A Writing-Mode Validator

Add deterministic checks for prose workflows:

scene file exists at requested path
scene word count is within configured range
drafter did not touch state files
state updater did not touch chapter prose
no TODOs, author notes, or bracket placeholders
optional checks for repeated headings or accidental prompt leakage

This should run before model review stages.

P2: Add A Test Analyzer Agent For TDD

Defer until generated tests are stable.

Possible pipeline:

write_tests -> validate_tests -> analyze_tests -> implement

Analyzer output should be concrete:

Implementation requirements:
- create_app(database_path) must return a Flask app.
- POST /snippets must return 201 and JSON id.
- GET /snippets/<id> must return persisted fields.

Do not modify:
- tests/test_task001.py

This may help smaller models, but it is another model output that can be wrong. Add it only after the fixed-test pipeline works through all DeadDrop tasks.

P2/P3: Add A Test Planner

Maybe, but defer.

This overlaps with:

planner
test analyzer
test governance

Too many planning-ish stages can make the pipeline bloated and contradictory.

If implemented later, keep it focused:

test_planner -> write_tests -> test_governance -> implement

For now, fold this idea into the future test governance/analyzer work.

P2: Add Run Comparison

Useful once comparing 14B vs 30B:

nightshift compare-runs --latest 5

Show:

model
task
retries
failure stage
final reason
runtime
token estimate

This should come after integ-test and integ-report.

P2: Add A Separate Multiagent/Fallback DeadDrop Experiment

Keep the default DeadDrop template boring and deterministic:

planner -> semantic_context -> context -> implement -> validate -> test -> review

If fallback is useful, put it in a separate experiment template, for example:

tutorial-deaddrop-multiagent

or:

examples/templates/multiagent-fallback

Reason:

fallback makes artifacts harder to reason about
model variability is bad while debugging pipeline behavior
the default template should remain the reliability harness

5.5 KiB Raw Blame History