# NightShift ## Auditable Local-First AI Coding Pipelines Version: v0.1 Draft Author: K455 Status: Design Proposal --- # 1. Executive Summary NightShift is a local-first AI pipeline runner designed to execute long-running coding workflows against a constrained project workspace. The system is intended to run overnight or unattended for extended periods while remaining: * Cheap * Correct * Auditable * Safe * Reviewable NightShift is not designed to be a fully autonomous "AI software engineer." Instead, it is a deterministic orchestration system that allows fallible AI agents to operate within constrained, test-driven, auditable workflows. The core philosophy is: > Treat LLMs like unreliable distributed systems. Agents are bounded by: * Scoped repository access * Structured stage contracts * Explicit retry behavior * Tests and static checks * Review stages * Context compaction * Artifact logging The intended workflow is: 1. User provides: * Repository * Task list * Pipeline configuration * Agent definitions 2. NightShift: * Selects the next task * Generates a plan * Reviews the plan * Implements changes * Runs tests/static analysis * Reviews results * Retries if necessary * Produces an overnight report The result is a reviewable repository state and a full audit trail of AI behavior. --- # 2. Goals ## 2.1 Primary Goals ### Local-first execution The system should work primarily with local models and local execution environments. Examples: * Ollama * Local transformers * Local agent runtimes * Claude Code * Codex CLI ### Long-running unattended workflows NightShift should support: * Overnight execution * Large task chains * Multi-stage workflows * Automated retries * Context handoff between stages ### Auditability Every important action should be recorded. Users should be able to inspect: * Prompts * Plans * Reviews * Command outputs * Diffs * Test results * Retry reasoning * Final summaries ### Cheapness-first execution The orchestration layer should assume: * Cheap local models handle most work * Expensive models are escalation layers * Context size matters * Token usage matters * Retry cost matters ### Safe repository boundaries The system should: * Restrict file access * Restrict shell commands * Avoid destructive operations * Minimize repository damage --- ## 2.2 Non-Goals (v1) The following are intentionally out of scope for v1: * Fully autonomous software development * Parallel distributed execution * Automatic deployment * Cloud-native orchestration * Dynamic self-modifying pipelines * Autonomous internet access * Agent swarms * Arbitrary Python execution hooks * Automatic git pushes * Full DAG orchestration --- # 3. Design Philosophy NightShift is built around several core principles. ## 3.1 Deterministic orchestration Agents are nondeterministic. The orchestration system should not be. Pipeline behavior should be: * Predictable * Reproducible * Configurable * Explicit --- ## 3.2 Structured state transitions NightShift uses a state-machine workflow model. A task moves through defined stages: ```text Task Queue -> Plan -> Plan Review -> Implement -> Test -> Static Check -> Review -> Retry / Complete ``` Each stage produces: ```yaml status: pass | fail | retry | escalate reason: string next_stage: optional context_update: optional ``` This allows the pipeline runner to remain deterministic even while agents are probabilistic. --- ## 3.3 Context compaction Agents should not inherit unlimited history. Instead: * Project-level context is persistent and compact * Task-level context is scoped * Retry context is summarized * Stage context is minimized This reduces: * Token costs * Context poisoning * Hallucination drift * Recursive confusion --- ## 3.4 Reviewability over autonomy NightShift is optimized to produce: * Reviewable code * Reviewable reports * Reviewable reasoning The primary output is: > A useful morning review state. Not: > Fully autonomous shipping. --- # 4. Architecture Overview ## 4.1 High-Level Components ```text +-------------------+ | Task Parser | +-------------------+ | v +-------------------+ | Pipeline Runner | +-------------------+ | v +-------------------+ | Stage Executor | +-------------------+ | | | +----------------+ | | v v +-----------+ +----------------+ | Agent API | | Command Runner | +-----------+ +----------------+ | | v v +-----------+ +----------------+ | LLM Model | | Test/Lint/etc | +-----------+ +----------------+ ``` --- ## 4.2 Core Components ### Task Parser Responsible for: * Reading markdown task files * Parsing acceptance criteria * Tracking completion state * Determining dependencies --- ### Pipeline Runner Responsible for: * Stage orchestration * Retry logic * State transitions * Artifact management * Context propagation --- ### Stage Executor Responsible for: * Executing stage definitions * Calling agents * Running commands * Collecting outputs --- ### Agent Layer Responsible for: * Prompt construction * Model backend integration * Structured output parsing * Context injection --- ### Command Runner Responsible for: * Executing tests * Static analysis * Formatting * Shell command restrictions * Sandboxing --- # 5. Workflow Model ## 5.1 State Machine Model NightShift uses a configurable state-machine workflow. This was selected over: * DAG orchestration * Arbitrary scripting because: * v1 executes one task at a time * Retry loops are first-class * Auditability is easier * Deterministic transitions are simpler --- ## 5.2 Default Pipeline ```text PLAN ↓ REVIEW_PLAN ↓ IMPLEMENT ↓ TEST ↓ STATIC_ANALYSIS ↓ REVIEW ↓ DECISION ``` Decision outcomes: * COMPLETE * RETRY_IMPLEMENTATION * RETRY_PLANNING * FAIL --- ## 5.3 Configurable Pipelines Pipelines are defined declaratively. Users may: * Swap stage orders * Add/remove stages * Define retry behavior * Use different models * A/B test prompts * Experiment with reasoning structures --- # 6. Configuration System ## 6.1 Configuration Format NightShift uses YAML configuration files. Reasons: * Human-readable * Good nested structure support * Easier workflow representation than TOML * Safer than arbitrary Python execution --- ## 6.2 Example Configuration ```yaml project: name: my-project root: . task_file: tasks.md artifact_dir: .nightshift safety: require_clean_worktree: true scoped_paths: - src/ - tests/ forbidden_commands: - rm -rf - git push allowed_commands: - cargo test - cargo fmt - cargo clippy agents: planner: backend: ollama model: qwen2.5-coder:14b system_prompt: agents/planner.md implementer: backend: claude-code model: sonnet system_prompt: agents/implementer.md reviewer: backend: ollama model: deepseek-r1:32b system_prompt: agents/reviewer.md pipeline: max_task_retries: 3 stages: - id: plan type: agent agent: planner - id: review_plan type: review agent: reviewer on_fail: plan - id: implement type: agent agent: implementer - id: test type: command commands: - cargo test - id: static type: command commands: - cargo fmt --check - cargo clippy -- -D warnings - id: review type: review agent: reviewer on_fail: implement ``` --- # 7. Task System ## 7.1 Task Format Tasks are defined in markdown. Example: ```markdown - [ ] TASK-001: Add retry support to pipeline runner Acceptance Criteria: - Retries configurable per stage - Retry summaries persisted - Retry count visible in final report ``` --- ## 7.2 Task Lifecycle Each task: 1. Is parsed 2. Is assigned a workspace 3. Receives planning 4. Receives implementation 5. Is validated 6. Is reviewed 7. Produces artifacts 8. Is marked complete or failed --- ## 7.3 Task Dependencies Future versions may support: ```text TASK-003 depends on TASK-001 ``` However: * Tasks should remain independently testable when possible * Pipelines should maintain a buildable repository state --- # 8. Agent Model ## 8.1 Agent Roles Agents are specialized. Example roles: * planner * implementer * reviewer * summarizer * test-writer --- ## 8.2 Agent Definitions Agents are configurable. Each agent defines: * Backend * Model * System prompt * Constraints * Output schema --- ## 8.3 Multi-Backend Support NightShift should support: * Ollama * Claude Code * Codex CLI * Future local runners This allows: * Cheap local planning * Expensive selective escalation * Hybrid pipelines --- ## 8.4 Structured Outputs Agents should emit machine-readable results. Example: ```yaml status: pass summary: | Tests succeeded. issues: - None next_stage: review ``` --- # 9. Context System ## 9.1 Context Layers NightShift uses layered context. ### Project Context Long-lived information: * Architecture * Coding standards * Constraints * Previous summaries --- ### Task Context Task-specific information: * Acceptance criteria * Relevant files * Prior retries * Implementation notes --- ### Retry Context Compact summaries of: * Previous failures * Previous reviews * Previous test errors --- ## 9.2 Context Compaction Every stage should summarize output. This prevents: * Infinite context growth * Token explosion * Recursive hallucination * Low-signal history accumulation --- # 10. Safety Model ## 10.1 Repository Scope Restrictions NightShift should restrict: * Accessible directories * Writable paths * Executable commands --- ## 10.2 Command Restrictions Commands are allowlisted. Potentially dangerous commands are forbidden. Examples: ```text Forbidden: - rm -rf - git push - curl | bash ``` --- ## 10.3 Clean Worktree Requirement v1 may optionally require: ```text git status == clean ``` before execution. This simplifies: * Auditability * Recovery * Diff inspection --- # 11. Testing and Validation ## 11.1 Validation Pipeline Validation occurs in multiple stages: ```text Tests ↓ Static Analysis ↓ Review Agent ↓ Decision ``` --- ## 11.2 Global Test Suite Tests are global. Rationale: * New changes must not break old functionality * Pipeline should maintain cumulative stability --- ## 11.3 Generated Tests Agents may generate tests for features. Generated tests become part of the persistent suite. --- # 12. Artifact System ## 12.1 Artifact Goals Artifacts provide: * Auditability * Replayability * Debugging * Historical inspection * Prompt experimentation --- ## 12.2 Example Layout ```text .nightshift/ project-context.md runs/ 2026-05-16-overnight/ run-summary.md config.snapshot.yaml tasks/ TASK-001/ task.md plan.md plan-review.md implementation-log.md test-output.txt static-output.txt review.md final-notes.md diff.patch context-out.md ``` --- # 13. Overnight Report At completion NightShift generates: * Completed tasks * Failed tasks * Retry counts * Files modified * Test results * Reviewer summaries * Remaining issues * Suggested follow-up work The goal is: > Wake up to a review package. --- # 14. Future Directions Potential future features: * Parallel task execution * DAG workflows * Distributed workers * Sandboxed containers * Git branch isolation * Agent tournaments * Constraint language experimentation * Prompt A/B testing * Semantic memory systems * Multi-repo orchestration * Web dashboard * Cost telemetry * Human approval gates --- # 15. Risks ## 15.1 Context poisoning Mitigation: * Context compaction * Retry summarization * Structured stage boundaries --- ## 15.2 Agent loops Mitigation: * Explicit retry counts * Deterministic transitions * Timeout handling --- ## 15.3 Repository damage Mitigation: * Scoped directories * Command restrictions * Validation stages --- ## 15.4 Cost explosion Mitigation: * Local-first execution * Context minimization * Escalation-only expensive models --- # 16. Implemented Baseline The MVP and post-MVP phases through phase 22 are implemented. NightShift currently provides: * `nightshift init` for starter project generation * `nightshift validate` for config, prompt, task, dependency, path, and command validation * `nightshift status` for read-only project inspection * `nightshift run` for the next runnable incomplete task * `nightshift run --task TASK-ID` for a specific task * `nightshift run --all` for sequential multi-task execution * `nightshift web` for a read-only artifact dashboard * Markdown task parsing with descriptions, acceptance criteria, completion state, and dependency bullets * Dependency validation for missing references and simple cycles * Dependency-aware task selection and task blocking * Declarative YAML pipeline execution * Command, agent, agent-review, review, and summarize stage handling * Retry redirection with a configured task retry limit * Command-backed agents * Ollama-backed local model agents * Prompt bundle construction with project, task, retry, and previous-stage context * Prompt snapshots and run metadata for experiment comparison * Optional experiment labels and prompt variant metadata * Command allowlists and forbidden-fragment checks * Optional shell-free command execution * Per-stage command timeouts * Project-root-restricted command working directories * Environment variable allowlists for command stages * Scoped path and artifact path safety checks * Optional clean-worktree enforcement * Pre-run and post-run git status artifacts * Per-task `diff.patch` artifacts * Task completion mutation for successful runs * Per-run and per-task markdown/text artifacts * Project, task, retry, and context-out files * Final task notes, stage summaries, task completion artifacts, and run summaries * Documentation for config, artifact review, troubleshooting, and quickstart workflows * A complete fake-agent quickstart Lisp example under `examples/quickstart-lisp/` The system remains sequential and local-first. It is designed to produce reviewable artifacts and repository state, not to deploy, push, or autonomously ship changes. --- # 17. Current Product Shape The implemented product is now a practical local runner rather than only a single-task MVP. ## 17.1 CLI Workflow Common workflow: ```text nightshift init nightshift validate nightshift status nightshift run nightshift run --task TASK-001 nightshift run --all nightshift web ``` The CLI can validate a project, select runnable tasks, enforce dependencies, run one or more tasks, and report artifact locations. ## 17.2 Artifact Workflow Artifacts are still the primary audit surface. Current run artifacts include: ```text .nightshift/ project-context.md runs/ / run-summary.md config.snapshot.yaml run-metadata.md prompts/ .md tasks/ TASK-001/ task.md context.md plan.md implementation-log.md test-output.txt review.md stage-results.md context-out.md task-completion.md git-status-before.txt git-status-after.txt diff.patch final-notes.md ``` Exact task artifact names depend on configured stage `output` values. ## 17.3 Dashboard Workflow The web dashboard is read-only and artifact-driven. It currently: * Lists runs from `.nightshift/runs/` * Shows run summaries * Links to text and markdown artifacts * Safely rejects artifact path traversal * Auto-refreshes It does not: * Start or stop runs * Mutate config or tasks * Provide approval gates * Stream live process output * Authenticate users ## 17.4 Known Limitations Current limitations: * Execution is sequential; there is no parallel task runner. * The web dashboard is read-only and artifact-oriented. * Live run progress is limited to basic CLI prints and artifact inspection. * Flask is optional; `nightshift web` requires it to be installed. * Ollama support depends on the user's local Ollama installation and model availability. * Git artifacts can be unavailable or degraded in non-git repositories or repositories blocked by Git safe-directory rules. * Task mutation is intentionally minimal and only flips matching checklist lines. * Command configuration is safer than the MVP but is still string-first for compatibility. * There is no branch isolation, resumable run state machine, approval workflow, or deployment integration. --- # 18. Next Major Update Plan The next major update should improve operational visibility while preserving the current artifact-first model. ## Phase 23: Improved Logging and Live Visibility NightShift should make active runs easier to observe from both the CLI and the web dashboard. Implementation tasks: * [ ] Add a small logging module with structured operational events. * [ ] Stream human-readable progress to the CLI during `run` and `run --all`. * [ ] Include run id, task id, stage id, agent/backend, command index, retry count, status, duration, and artifact path where available. * [ ] Write a per-run log file such as `.nightshift/runs//run.log`. * [ ] Optionally write or rotate an aggregate `.nightshift/nightshift.log` for cross-run troubleshooting. * [ ] Keep logs operational; do not duplicate full prompts, full model responses, or full command output that already lives in artifacts. * [ ] Redact or avoid secrets from logged environment/config values. * [ ] Add dashboard support for viewing the latest log tail. * [ ] Cap the dashboard log view to the last 100 lines by default. * [ ] Keep the full per-run log file available as an artifact unless a later size cap is configured. * [ ] Auto-refresh the dashboard log view with the existing dashboard refresh model. * [ ] Add tests for log writing, CLI progress hooks, dashboard log rendering, missing log files, and the 100-line cap. Acceptance Criteria: * A user running NightShift from a terminal can tell which task and stage are active. * Long Ollama or command stages show enough lifecycle information that the process does not appear hung. * The latest run log is visible from `nightshift web`. * The web client displays at most the last 100 log lines by default. * Logs point users to detailed artifacts instead of replacing them. * Missing or partial log files do not crash the dashboard. Notes: * This phase should not add process control, websockets, authentication, or write actions to the web client. * If future live streaming is needed, the first version can still use file tailing plus refresh before introducing websockets. * Operational logs should complement artifacts: artifacts remain the source of detailed prompts, responses, command output, diffs, and summaries. ## Phase 24: Per-Agent Model Parameters - [ ] Add `temperature` to agent config. - [ ] Pass temperature to Ollama/OpenAI-compatible backends. - [ ] Default safely if omitted. - [ ] Add config validation tests. ## Phase 25: Repo Lookup Tools MVP - [ ] Add tool interface for repo operations. - [ ] Implement scoped `list_files`. - [ ] Implement scoped `read_file`. - [ ] Implement scoped `grep`. - [ ] Enforce existing path safety rules. - [ ] Log tool calls as artifacts. ## Phase 26: Planner Code-Discovery Support - [ ] Teach planner prompt to request needed code context. - [ ] Add structured planner output for lookup requests. - [ ] Execute requested lookup tools. - [ ] Save `files-inspected.md`. - [ ] Re-run planner with retrieved context. ## Phase 27: Context Pack Builder - [ ] Add `repo_context` stage. - [ ] Generate `context-pack.md`. - [ ] Include task, acceptance criteria, relevant files, snippets, and constraints. - [ ] Add line-numbered excerpts. - [ ] Add context-size caps. ## Phase 28: Project Context Chart MVP - [ ] Generate `.nightshift/project-context-chart.md`. - [ ] Include files, responsibilities, functions/classes, entry points, tests. - [ ] Use simple regex/parser MVP. - [ ] Update chart during planning. - [ ] Store anchors/line numbers/search terms. ## Phase 29: Code Writer Stage - [ ] Add `code_writer` stage type. - [ ] Feed it task + context pack. - [ ] Require unified diff output. - [ ] Save `proposed.patch`. - [ ] Save `implementation-summary.md`. ## Phase 30: Patch Normalization - [ ] Add `patch_normalizer` stage. - [ ] Support low-temperature formatter model. - [ ] Convert messy model output to valid unified diff. - [ ] Reject missing/ambiguous edits. - [ ] Save `normalized.patch`. ## Phase 31: Patch Validation - [ ] Parse unified diffs. - [ ] Reject malformed patches. - [ ] Enforce scoped paths. - [ ] Reject path traversal. - [ ] Enforce max files/max lines changed. - [ ] Reject forbidden files. ## Phase 32: Patch Apply / Dry Run - [ ] Add `patch_apply` stage. - [ ] Support `mode: dry_run`. - [ ] Support `mode: apply`. - [ ] Save `applied.patch`. - [ ] Preserve pre/post git status. - [ ] Fail cleanly on apply errors. ## Phase 33: Test Feedback Repair Loop - [ ] Feed test/static failure output back into implementer. - [ ] Add bounded repair attempts. - [ ] Save each repair patch. - [ ] Save repair summaries. - [ ] Stop after max retry count. ## Phase 34: End-to-End Coding Quickstart - [ ] Update quickstart to modify real code. - [ ] Include fake-agent test fixture. - [ ] Demonstrate lookup → context pack → patch → apply → test. - [ ] Document dry-run vs apply mode. --- # Appendix A: Design Decisions and Rationale ## A.1 Local-first architecture Decision: * Prefer local models and local execution Reasoning: * Cheapness-first design * Better experimentation * Better privacy * Reduced vendor dependency * Better overnight scalability --- ## A.2 State machine over DAG Decision: * Use configurable state-machine workflows Reasoning: * One-task-at-a-time execution * Retry loops are primary workflow behavior * Easier auditing * Easier debugging * Simpler MVP --- ## A.3 YAML configuration Decision: * Use declarative YAML config Reasoning: * Human-readable * Easier nested workflow representation * Safer than arbitrary Python * Better portability --- ## A.4 Cheapness-first model routing Decision: * Use expensive models selectively Reasoning: * Overnight pipelines can become token-expensive * Local models are sufficient for many stages * Review stages benefit more from premium models --- ## A.5 Strict repository scoping Decision: * Limit writable paths and executable commands Reasoning: * Prevent accidental damage * Maintain trust in unattended execution * Improve auditability --- ## A.6 Reviewable output over autonomy Decision: * Produce review packages rather than autonomous shipping Reasoning: * Human review remains critical * Improves safety * Improves correctness * Keeps architecture grounded and practical --- ## A.7 Layered context model Decision: * Separate project, task, and retry context Reasoning: * Reduces token usage * Prevents context explosion * Improves signal quality * Prevents recursive drift --- ## A.8 Artifact-heavy architecture Decision: * Persist plans, logs, reviews, outputs, and summaries Reasoning: * Debugging * Prompt experimentation * A/B testing * Replayability * Portfolio visibility --- ## A.9 No parallelism in v1 Decision: * Execute one task at a time Reasoning: * Simpler correctness model * Easier debugging * Easier repository safety * Easier context management --- ## A.10 Declarative pipelines first Decision: * No arbitrary Python hooks in v1 Reasoning: * Safer execution * Easier reproducibility * Easier auditing * Easier portability --- # Closing Statement NightShift is intended to explore a practical middle ground between: * Fully manual software engineering * Reckless autonomous agent systems The system assumes that AI agents are useful but unreliable. NightShift therefore treats agents as bounded workers inside deterministic, auditable, test-driven workflows. The primary output is not blind autonomy. The primary output is trustworthy leverage.