nightshift/docs/ideas.md

7.7 KiB

Ideas TODO

This file tracks open ideas only. Completed items should be removed after they land.

Priority scale:

  • P0: do next; directly improves current feedback loop
  • P1: important after the current loop is usable
  • P2: useful, but only after basics are stable
  • P3: defer or maybe reject

P1: Add Test Governance For Generated Tests

Use this only for generated-test mode. Do not put generated tests back into the default DeadDrop fixed-test pipeline yet.

The previous failures proved test-writing agents will:

  • edit app code
  • import nonexistent modules
  • require undeclared dependencies
  • inspect implementation internals
  • write tests for future behavior

Governance should be deterministic first, model-reviewed second.

Deterministic checks:

  • test-writing stage may only touch tests/
  • tests compile
  • tests import only allowed public interfaces
  • tests do not import undeclared dependencies
  • tests do not define Flask routes or app implementation
  • test names match current task id or current artifact
  • no future-task keywords unless accepted by current task acceptance criteria

Then optional model reviewer checks acceptance-criteria alignment.

P0: Preserve Good Drafts During Repair

When a generated file block contains useful allowed content plus disallowed or invalid extra content, avoid redrafting from scratch.

Possible behavior:

  • keep the allowed candidate file artifact
  • strip disallowed file blocks only when configured as safe for that stage
  • continue with validation for the allowed content
  • or ask the model for a minimal correction that preserves the accepted candidate

For writing workflows, preserving a good scene is more valuable than forcing a full retry.

P0: Remove Runtime Overrides For Custom Ollama Models

If a model is a tuned local Ollama model such as nightshift-writer or nightshift-base, prefer the Modelfile parameters unless the stage has a specific reason to override them.

Candidate config cleanup:

  • remove temperature
  • remove num_ctx
  • remove num_predict
  • remove stop if present

This avoids NightShift accidentally overriding tuned custom-model behavior.

P1: Improve what-happened For Model Runs

The report should identify usable intermediate work, not only final failure state.

Examples:

  • model produced a valid scene candidate
  • validation rejected extra state files
  • recover candidate from candidate-files/<stage>/index.md
  • retry output was invalid or too short
  • next recommended action

This should make failed creative-writing runs reviewable without manually reading every artifact.

P1: Add Stage-Specific Task Views

The same task may say both "write scene" and "update state", but those responsibilities belong to different stages.

Stage prompts should receive a filtered task view:

  • drafter sees only scene-writing criteria
  • state updater sees only durable state update criteria
  • reviewers see criteria relevant to their review role

This reduces prompt contradiction and makes deterministic stage rules easier for models to follow.

P1: Preserve Intra-Attempt Rerun Artifacts

When NightShift re-runs an agent inside the same stage attempt, do not overwrite the previous artifact.

Examples:

  • draft_scene-agent-output.md
  • draft_scene-agent-output-invalid-rerun-1.md
  • draft_scene-agent-output-1.md

This keeps the initial useful output visible even when strict rerun output is worse.

P1: Classify Writing Review Failures For Repair Routing

The tutorial novel now has a short-term editor stage for review failures, but review failures should eventually be classified before routing.

Candidate classes:

  • local_edit: pronoun drift, small continuity issue, missing beat, light style correction
  • redraft: wrong premise, broken scene structure, impossible chronology, severe acceptance mismatch
  • escalate: ambiguous canon conflict or user preference needed

Route local_edit to the editor, redraft to the drafter, and escalate to a clear user-facing failure. Keep original draft and edited draft artifacts side by side for comparison.

P1: Separate Expressive Drafting From Corrective Editing

If writing quality feels lower after adding stricter correctness constraints, test a two-pass creative workflow:

free_draft -> corrective_edit -> review

The drafter would receive fewer mechanical constraints and focus on voice, scene energy, imagery, and character presence. A later editor pass would enforce:

  • output format
  • canonical pronouns
  • continuity details
  • acceptance criteria
  • minor cleanup

This may recover stronger prose while keeping correctness guarantees. Treat it as an experiment, not a default, because weaker local models may use the looser draft prompt as permission to ignore task boundaries.

P1: Add A Writing-Mode Validator

Add deterministic checks for prose workflows:

  • scene file exists at requested path
  • scene word count is within configured range
  • drafter did not touch state files
  • state updater did not touch chapter prose
  • no TODOs, author notes, or bracket placeholders
  • optional checks for repeated headings or accidental prompt leakage

This should run before model review stages.

P1: Use Structured State Events For Writing Workflows

Replace model-written full state-file rewrites with compact structured state events, then let NightShift deterministically merge them into durable files such as:

  • story/plot-state.md
  • story/characters.md
  • story/timeline.md
  • story/unresolved-threads.md

Candidate state updater output:

events:
  - file: story/plot-state.md
    section: Completed Scenes
    add:
      - SCENE-001 complete; Saint and Miette introduced.
  - file: story/unresolved-threads.md
    section: Open Threads
    add:
      - Saint depends emotionally on Miette and needs compute tokens to keep her present.

NightShift would validate allowed files/sections, reject unknown targets, and apply append/update operations deterministically. This avoids asking a writing model to rewrite entire durable state files after every scene.

P2: Add A Test Analyzer Agent For TDD

Defer until generated tests are stable.

Possible pipeline:

write_tests -> validate_tests -> analyze_tests -> implement

Analyzer output should be concrete:

Implementation requirements:
- create_app(database_path) must return a Flask app.
- POST /snippets must return 201 and JSON id.
- GET /snippets/<id> must return persisted fields.

Do not modify:
- tests/test_task001.py

This may help smaller models, but it is another model output that can be wrong. Add it only after the fixed-test pipeline works through all DeadDrop tasks.

P2/P3: Add A Test Planner

Maybe, but defer.

This overlaps with:

  • planner
  • test analyzer
  • test governance

Too many planning-ish stages can make the pipeline bloated and contradictory.

If implemented later, keep it focused:

test_planner -> write_tests -> test_governance -> implement

For now, fold this idea into the future test governance/analyzer work.

P2: Add Run Comparison

Useful once comparing 14B vs 30B:

nightshift compare-runs --latest 5

Show:

  • model
  • task
  • retries
  • failure stage
  • final reason
  • runtime
  • token estimate

This should come after integ-test and integ-report.

P2: Add A Separate Multiagent/Fallback DeadDrop Experiment

Keep the default DeadDrop template boring and deterministic:

planner -> semantic_context -> context -> implement -> validate -> test -> review

If fallback is useful, put it in a separate experiment template, for example:

tutorial-deaddrop-multiagent

or:

examples/templates/multiagent-fallback

Reason:

  • fallback makes artifacts harder to reason about
  • model variability is bad while debugging pipeline behavior
  • the default template should remain the reliability harness