Checked out commit from rsarv3006 which is super interesting, grabbed some inspiration from it and mentioned it in the ideas file.
8.4 KiB
Ideas TODO
This file tracks open ideas only. Completed items should be removed after they land.
Priority scale:
- P0: do next; directly improves current feedback loop
- P1: important after the current loop is usable
- P2: useful, but only after basics are stable
- P3: defer or maybe reject
P1: Add Test Governance For Generated Tests
Use this only for generated-test mode. Do not put generated tests back into the default DeadDrop fixed-test pipeline yet.
The previous failures proved test-writing agents will:
- edit app code
- import nonexistent modules
- require undeclared dependencies
- inspect implementation internals
- write tests for future behavior
Governance should be deterministic first, model-reviewed second.
Deterministic checks:
- test-writing stage may only touch
tests/ - tests compile
- tests import only allowed public interfaces
- tests do not import undeclared dependencies
- tests do not define Flask routes or app implementation
- test names match current task id or current artifact
- no future-task keywords unless accepted by current task acceptance criteria
Then optional model reviewer checks acceptance-criteria alignment.
P0: Preserve Good Drafts During Repair
When a generated file block contains useful allowed content plus disallowed or invalid extra content, avoid redrafting from scratch.
Possible behavior:
- keep the allowed candidate file artifact
- strip disallowed file blocks only when configured as safe for that stage
- continue with validation for the allowed content
- or ask the model for a minimal correction that preserves the accepted candidate
For writing workflows, preserving a good scene is more valuable than forcing a full retry.
P0: Remove Runtime Overrides For Custom Ollama Models
If a model is a tuned local Ollama model such as nightshift-writer or nightshift-base, prefer the Modelfile parameters unless the stage has a specific reason to override them.
Candidate config cleanup:
- remove
temperature - remove
num_ctx - remove
num_predict - remove
stopif present
This avoids NightShift accidentally overriding tuned custom-model behavior.
P1: Improve what-happened For Model Runs
The report should identify usable intermediate work, not only final failure state.
Examples:
- model produced a valid scene candidate
- validation rejected extra state files
- recover candidate from
candidate-files/<stage>/index.md - retry output was invalid or too short
- next recommended action
This should make failed creative-writing runs reviewable without manually reading every artifact.
P1: Add Stage-Specific Task Views
The same task may say both "write scene" and "update state", but those responsibilities belong to different stages.
Stage prompts should receive a filtered task view:
- drafter sees only scene-writing criteria
- state updater sees only durable state update criteria
- reviewers see criteria relevant to their review role
This reduces prompt contradiction and makes deterministic stage rules easier for models to follow.
P1: Preserve Intra-Attempt Rerun Artifacts
When NightShift re-runs an agent inside the same stage attempt, do not overwrite the previous artifact.
Examples:
draft_scene-agent-output.mddraft_scene-agent-output-invalid-rerun-1.mddraft_scene-agent-output-1.md
This keeps the initial useful output visible even when strict rerun output is worse.
P1: Classify Writing Review Failures For Repair Routing
The tutorial novel now has a short-term editor stage for review failures, but review failures should eventually be classified before routing.
Candidate classes:
local_edit: pronoun drift, small continuity issue, missing beat, light style correctionredraft: wrong premise, broken scene structure, impossible chronology, severe acceptance mismatchescalate: ambiguous canon conflict or user preference needed
Route local_edit to the editor, redraft to the drafter, and escalate to a clear user-facing failure. Keep original draft and edited draft artifacts side by side for comparison.
P1: Separate Expressive Drafting From Corrective Editing
If writing quality feels lower after adding stricter correctness constraints, test a two-pass creative workflow:
free_draft -> corrective_edit -> review
The drafter would receive fewer mechanical constraints and focus on voice, scene energy, imagery, and character presence. A later editor pass would enforce:
- output format
- canonical pronouns
- continuity details
- acceptance criteria
- minor cleanup
This may recover stronger prose while keeping correctness guarantees. Treat it as an experiment, not a default, because weaker local models may use the looser draft prompt as permission to ignore task boundaries.
P1: Add A Writing-Mode Validator
Add deterministic checks for prose workflows:
- scene file exists at requested path
- scene word count is within configured range
- drafter did not touch state files
- state updater did not touch chapter prose
- no TODOs, author notes, or bracket placeholders
- optional checks for repeated headings or accidental prompt leakage
This should run before model review stages.
P1: Use Structured State Events For Writing Workflows
Replace model-written full state-file rewrites with compact structured state events, then let NightShift deterministically merge them into durable files such as:
story/plot-state.mdstory/characters.mdstory/timeline.mdstory/unresolved-threads.md
Candidate state updater output:
events:
- file: story/plot-state.md
section: Completed Scenes
add:
- SCENE-001 complete; Saint and Miette introduced.
- file: story/unresolved-threads.md
section: Open Threads
add:
- Saint depends emotionally on Miette and needs compute tokens to keep her present.
NightShift would validate allowed files/sections, reject unknown targets, and apply append/update operations deterministically. This avoids asking a writing model to rewrite entire durable state files after every scene.
P2: Add A Test Analyzer Agent For TDD
Defer until generated tests are stable.
Possible pipeline:
write_tests -> validate_tests -> analyze_tests -> implement
Analyzer output should be concrete:
Implementation requirements:
- create_app(database_path) must return a Flask app.
- POST /snippets must return 201 and JSON id.
- GET /snippets/<id> must return persisted fields.
Do not modify:
- tests/test_task001.py
This may help smaller models, but it is another model output that can be wrong. Add it only after the fixed-test pipeline works through all DeadDrop tasks.
P2/P3: Add A Test Planner
Maybe, but defer.
This overlaps with:
- planner
- test analyzer
- test governance
Too many planning-ish stages can make the pipeline bloated and contradictory.
If implemented later, keep it focused:
test_planner -> write_tests -> test_governance -> implement
For now, fold this idea into the future test governance/analyzer work.
P2: Add Run Comparison
Useful once comparing 14B vs 30B:
nightshift compare-runs --latest 5
Show:
- model
- task
- retries
- failure stage
- final reason
- runtime
- token estimate
This should come after integ-test and integ-report.
P2: Add A Separate Multiagent/Fallback DeadDrop Experiment
Keep the default DeadDrop template boring and deterministic:
planner -> semantic_context -> context -> implement -> validate -> test -> review
If fallback is useful, put it in a separate experiment template, for example:
tutorial-deaddrop-multiagent
or:
examples/templates/multiagent-fallback
Reason:
- fallback makes artifacts harder to reason about
- model variability is bad while debugging pipeline behavior
- the default template should remain the reliability harness
P2: Adopt Useful Fork Ideas From rsarv3006
Source: 649eef6554
Author: GitHub user rsarv3006
Useful ideas to consider porting:
- Add
on_statusstage routing so review stages can routepass,retry,fail, andescalateto different follow-up stages. - Add configurable repo lookup exclusions, for example
safety.skip_repo_parts, so projects can hide generated or irrelevant directories from planner/reviewer context tools. - Add configurable agent timeout, for example
pipeline.agent_timeout_seconds, so long local-model runs can be tuned per project. - Add docs and focused tests around status-based routing behavior.