nightshift/docs/design.md

1279 lines
24 KiB
Markdown

# NightShift
## Auditable Local-First AI Coding Pipelines
Version: v0.1 Draft
Author: K455
Status: Design Proposal
---
# 1. Executive Summary
NightShift is a local-first AI pipeline runner designed to execute long-running coding workflows against a constrained project workspace.
The system is intended to run overnight or unattended for extended periods while remaining:
* Cheap
* Correct
* Auditable
* Safe
* Reviewable
NightShift is not designed to be a fully autonomous "AI software engineer."
Instead, it is a deterministic orchestration system that allows fallible AI agents to operate within constrained, test-driven, auditable workflows.
The core philosophy is:
> Treat LLMs like unreliable distributed systems.
Agents are bounded by:
* Scoped repository access
* Structured stage contracts
* Explicit retry behavior
* Tests and static checks
* Review stages
* Context compaction
* Artifact logging
The intended workflow is:
1. User provides:
* Repository
* Task list
* Pipeline configuration
* Agent definitions
2. NightShift:
* Selects the next task
* Generates a plan
* Reviews the plan
* Implements changes
* Runs tests/static analysis
* Reviews results
* Retries if necessary
* Produces an overnight report
The result is a reviewable repository state and a full audit trail of AI behavior.
---
# 2. Goals
## 2.1 Primary Goals
### Local-first execution
The system should work primarily with local models and local execution environments.
Examples:
* Ollama
* Local transformers
* Local agent runtimes
* Claude Code
* Codex CLI
### Long-running unattended workflows
NightShift should support:
* Overnight execution
* Large task chains
* Multi-stage workflows
* Automated retries
* Context handoff between stages
### Auditability
Every important action should be recorded.
Users should be able to inspect:
* Prompts
* Plans
* Reviews
* Command outputs
* Diffs
* Test results
* Retry reasoning
* Final summaries
### Cheapness-first execution
The orchestration layer should assume:
* Cheap local models handle most work
* Expensive models are escalation layers
* Context size matters
* Token usage matters
* Retry cost matters
### Safe repository boundaries
The system should:
* Restrict file access
* Restrict shell commands
* Avoid destructive operations
* Minimize repository damage
---
## 2.2 Non-Goals (v1)
The following are intentionally out of scope for v1:
* Fully autonomous software development
* Parallel distributed execution
* Automatic deployment
* Cloud-native orchestration
* Dynamic self-modifying pipelines
* Autonomous internet access
* Agent swarms
* Arbitrary Python execution hooks
* Automatic git pushes
* Full DAG orchestration
---
# 3. Design Philosophy
NightShift is built around several core principles.
## 3.1 Deterministic orchestration
Agents are nondeterministic.
The orchestration system should not be.
Pipeline behavior should be:
* Predictable
* Reproducible
* Configurable
* Explicit
---
## 3.2 Structured state transitions
NightShift uses a state-machine workflow model.
A task moves through defined stages:
```text
Task Queue
-> Plan
-> Plan Review
-> Implement
-> Test
-> Static Check
-> Review
-> Retry / Complete
```
Each stage produces:
```yaml
status: pass | fail | retry | escalate
reason: string
next_stage: optional
context_update: optional
```
This allows the pipeline runner to remain deterministic even while agents are probabilistic.
---
## 3.3 Context compaction
Agents should not inherit unlimited history.
Instead:
* Project-level context is persistent and compact
* Task-level context is scoped
* Retry context is summarized
* Stage context is minimized
This reduces:
* Token costs
* Context poisoning
* Hallucination drift
* Recursive confusion
---
## 3.4 Reviewability over autonomy
NightShift is optimized to produce:
* Reviewable code
* Reviewable reports
* Reviewable reasoning
The primary output is:
> A useful morning review state.
Not:
> Fully autonomous shipping.
---
# 4. Architecture Overview
## 4.1 High-Level Components
```text
+-------------------+
| Task Parser |
+-------------------+
|
v
+-------------------+
| Pipeline Runner |
+-------------------+
|
v
+-------------------+
| Stage Executor |
+-------------------+
| |
| +----------------+
| |
v v
+-----------+ +----------------+
| Agent API | | Command Runner |
+-----------+ +----------------+
| |
v v
+-----------+ +----------------+
| LLM Model | | Test/Lint/etc |
+-----------+ +----------------+
```
---
## 4.2 Core Components
### Task Parser
Responsible for:
* Reading markdown task files
* Parsing acceptance criteria
* Tracking completion state
* Determining dependencies
---
### Pipeline Runner
Responsible for:
* Stage orchestration
* Retry logic
* State transitions
* Artifact management
* Context propagation
---
### Stage Executor
Responsible for:
* Executing stage definitions
* Calling agents
* Running commands
* Collecting outputs
---
### Agent Layer
Responsible for:
* Prompt construction
* Model backend integration
* Structured output parsing
* Context injection
---
### Command Runner
Responsible for:
* Executing tests
* Static analysis
* Formatting
* Shell command restrictions
* Sandboxing
---
# 5. Workflow Model
## 5.1 State Machine Model
NightShift uses a configurable state-machine workflow.
This was selected over:
* DAG orchestration
* Arbitrary scripting
because:
* v1 executes one task at a time
* Retry loops are first-class
* Auditability is easier
* Deterministic transitions are simpler
---
## 5.2 Default Pipeline
```text
PLAN
REVIEW_PLAN
IMPLEMENT
TEST
STATIC_ANALYSIS
REVIEW
DECISION
```
Decision outcomes:
* COMPLETE
* RETRY_IMPLEMENTATION
* RETRY_PLANNING
* FAIL
---
## 5.3 Configurable Pipelines
Pipelines are defined declaratively.
Users may:
* Swap stage orders
* Add/remove stages
* Define retry behavior
* Use different models
* A/B test prompts
* Experiment with reasoning structures
---
# 6. Configuration System
## 6.1 Configuration Format
NightShift uses YAML configuration files.
Reasons:
* Human-readable
* Good nested structure support
* Easier workflow representation than TOML
* Safer than arbitrary Python execution
---
## 6.2 Example Configuration
```yaml
project:
name: my-project
root: .
task_file: tasks.md
artifact_dir: .nightshift
safety:
require_clean_worktree: true
scoped_paths:
- src/
- tests/
forbidden_commands:
- rm -rf
- git push
allowed_commands:
- cargo test
- cargo fmt
- cargo clippy
agents:
planner:
backend: ollama
model: qwen2.5-coder:14b
system_prompt: agents/planner.md
implementer:
backend: claude-code
model: sonnet
system_prompt: agents/implementer.md
reviewer:
backend: ollama
model: deepseek-r1:32b
system_prompt: agents/reviewer.md
pipeline:
max_task_retries: 3
stages:
- id: plan
type: agent
agent: planner
- id: review_plan
type: review
agent: reviewer
on_fail: plan
- id: implement
type: agent
agent: implementer
- id: test
type: command
commands:
- cargo test
- id: static
type: command
commands:
- cargo fmt --check
- cargo clippy -- -D warnings
- id: review
type: review
agent: reviewer
on_fail: implement
```
---
# 7. Task System
## 7.1 Task Format
Tasks are defined in markdown.
Example:
```markdown
- [ ] TASK-001: Add retry support to pipeline runner
Acceptance Criteria:
- Retries configurable per stage
- Retry summaries persisted
- Retry count visible in final report
```
---
## 7.2 Task Lifecycle
Each task:
1. Is parsed
2. Is assigned a workspace
3. Receives planning
4. Receives implementation
5. Is validated
6. Is reviewed
7. Produces artifacts
8. Is marked complete or failed
---
## 7.3 Task Dependencies
Future versions may support:
```text
TASK-003 depends on TASK-001
```
However:
* Tasks should remain independently testable when possible
* Pipelines should maintain a buildable repository state
---
# 8. Agent Model
## 8.1 Agent Roles
Agents are specialized.
Example roles:
* planner
* implementer
* reviewer
* summarizer
* test-writer
---
## 8.2 Agent Definitions
Agents are configurable.
Each agent defines:
* Backend
* Model
* System prompt
* Constraints
* Output schema
---
## 8.3 Multi-Backend Support
NightShift should support:
* Ollama
* Claude Code
* Codex CLI
* Future local runners
This allows:
* Cheap local planning
* Expensive selective escalation
* Hybrid pipelines
---
## 8.4 Structured Outputs
Agents should emit machine-readable results.
Example:
```yaml
status: pass
summary: |
Tests succeeded.
issues:
- None
next_stage: review
```
---
# 9. Context System
## 9.1 Context Layers
NightShift uses layered context.
### Project Context
Long-lived information:
* Architecture
* Coding standards
* Constraints
* Previous summaries
---
### Task Context
Task-specific information:
* Acceptance criteria
* Relevant files
* Prior retries
* Implementation notes
---
### Retry Context
Compact summaries of:
* Previous failures
* Previous reviews
* Previous test errors
---
## 9.2 Context Compaction
Every stage should summarize output.
This prevents:
* Infinite context growth
* Token explosion
* Recursive hallucination
* Low-signal history accumulation
---
# 10. Safety Model
## 10.1 Repository Scope Restrictions
NightShift should restrict:
* Accessible directories
* Writable paths
* Executable commands
---
## 10.2 Command Restrictions
Commands are allowlisted.
Potentially dangerous commands are forbidden.
Examples:
```text
Forbidden:
- rm -rf
- git push
- curl | bash
```
---
## 10.3 Clean Worktree Requirement
v1 may optionally require:
```text
git status == clean
```
before execution.
This simplifies:
* Auditability
* Recovery
* Diff inspection
---
# 11. Testing and Validation
## 11.1 Validation Pipeline
Validation occurs in multiple stages:
```text
Tests
Static Analysis
Review Agent
Decision
```
---
## 11.2 Global Test Suite
Tests are global.
Rationale:
* New changes must not break old functionality
* Pipeline should maintain cumulative stability
---
## 11.3 Generated Tests
Agents may generate tests for features.
Generated tests become part of the persistent suite.
---
# 12. Artifact System
## 12.1 Artifact Goals
Artifacts provide:
* Auditability
* Replayability
* Debugging
* Historical inspection
* Prompt experimentation
---
## 12.2 Example Layout
```text
.nightshift/
project-context.md
runs/
2026-05-16-overnight/
run-summary.md
config.snapshot.yaml
tasks/
TASK-001/
task.md
plan.md
plan-review.md
implementation-log.md
test-output.txt
static-output.txt
review.md
final-notes.md
diff.patch
context-out.md
```
---
# 13. Overnight Report
At completion NightShift generates:
* Completed tasks
* Failed tasks
* Retry counts
* Files modified
* Test results
* Reviewer summaries
* Remaining issues
* Suggested follow-up work
The goal is:
> Wake up to a review package.
---
# 14. Future Directions
Potential future features:
* Parallel task execution
* DAG workflows
* Distributed workers
* Sandboxed containers
* Git branch isolation
* Agent tournaments
* Constraint language experimentation
* Prompt A/B testing
* Semantic memory systems
* Multi-repo orchestration
* Web dashboard
* Cost telemetry
* Human approval gates
---
# 15. Risks
## 15.1 Context poisoning
Mitigation:
* Context compaction
* Retry summarization
* Structured stage boundaries
---
## 15.2 Agent loops
Mitigation:
* Explicit retry counts
* Deterministic transitions
* Timeout handling
---
## 15.3 Repository damage
Mitigation:
* Scoped directories
* Command restrictions
* Validation stages
---
## 15.4 Cost explosion
Mitigation:
* Local-first execution
* Context minimization
* Escalation-only expensive models
---
# 16. Implemented Baseline
The MVP and post-MVP phases through phase 22 are implemented.
NightShift currently provides:
* `nightshift init` for starter project generation
* `nightshift validate` for config, prompt, task, dependency, path, and command validation
* `nightshift status` for read-only project inspection
* `nightshift run` for the next runnable incomplete task
* `nightshift run --task TASK-ID` for a specific task
* `nightshift run --all` for sequential multi-task execution
* `nightshift web` for a read-only artifact dashboard
* Markdown task parsing with descriptions, acceptance criteria, completion state, and dependency bullets
* Dependency validation for missing references and simple cycles
* Dependency-aware task selection and task blocking
* Declarative YAML pipeline execution
* Command, agent, agent-review, review, and summarize stage handling
* Retry redirection with a configured task retry limit
* Command-backed agents
* Ollama-backed local model agents
* Prompt bundle construction with project, task, retry, and previous-stage context
* Prompt snapshots and run metadata for experiment comparison
* Optional experiment labels and prompt variant metadata
* Command allowlists and forbidden-fragment checks
* Optional shell-free command execution
* Per-stage command timeouts
* Project-root-restricted command working directories
* Environment variable allowlists for command stages
* Scoped path and artifact path safety checks
* Optional clean-worktree enforcement
* Pre-run and post-run git status artifacts
* Per-task `diff.patch` artifacts
* Task completion mutation for successful runs
* Per-run and per-task markdown/text artifacts
* Project, task, retry, and context-out files
* Final task notes, stage summaries, task completion artifacts, and run summaries
* Documentation for config, artifact review, troubleshooting, and quickstart workflows
* A complete fake-agent quickstart Lisp example under `examples/quickstart-lisp/`
The system remains sequential and local-first. It is designed to produce reviewable artifacts and repository state, not to deploy, push, or autonomously ship changes.
---
# 17. Current Product Shape
The implemented product is now a practical local runner rather than only a single-task MVP.
## 17.1 CLI Workflow
Common workflow:
```text
nightshift init
nightshift validate
nightshift status
nightshift run
nightshift run --task TASK-001
nightshift run --all
nightshift web
```
The CLI can validate a project, select runnable tasks, enforce dependencies, run one or more tasks, and report artifact locations.
## 17.2 Artifact Workflow
Artifacts are still the primary audit surface.
Current run artifacts include:
```text
.nightshift/
project-context.md
runs/
<run-id>/
run-summary.md
config.snapshot.yaml
run-metadata.md
prompts/
<agent-id>.md
tasks/
TASK-001/
task.md
context.md
plan.md
implementation-log.md
test-output.txt
review.md
stage-results.md
context-out.md
task-completion.md
git-status-before.txt
git-status-after.txt
diff.patch
final-notes.md
```
Exact task artifact names depend on configured stage `output` values.
## 17.3 Dashboard Workflow
The web dashboard is read-only and artifact-driven.
It currently:
* Lists runs from `.nightshift/runs/`
* Shows run summaries
* Links to text and markdown artifacts
* Safely rejects artifact path traversal
* Auto-refreshes
It does not:
* Start or stop runs
* Mutate config or tasks
* Provide approval gates
* Stream live process output
* Authenticate users
## 17.4 Known Limitations
Current limitations:
* Execution is sequential; there is no parallel task runner.
* The web dashboard is read-only and artifact-oriented.
* Live run progress is limited to basic CLI prints and artifact inspection.
* Flask is optional; `nightshift web` requires it to be installed.
* Ollama support depends on the user's local Ollama installation and model availability.
* Git artifacts can be unavailable or degraded in non-git repositories or repositories blocked by Git safe-directory rules.
* Task mutation is intentionally minimal and only flips matching checklist lines.
* Command configuration is safer than the MVP but is still string-first for compatibility.
* There is no branch isolation, resumable run state machine, approval workflow, or deployment integration.
---
# 18. Next Major Update Plan
The next major update should improve operational visibility while preserving the current artifact-first model.
Phase work is tracked in this design document by updating the relevant phase checklist and adding concise implementation notes only when a decision needs durable context. The old `docs/devlog/` phase files have been retired.
## Phase 23: Improved Logging and Live Visibility
NightShift should make active runs easier to observe from both the CLI and the web dashboard.
Implementation tasks:
* [x] Add a small logging module with structured operational events.
* [x] Stream human-readable progress to the CLI during `run` and `run --all`.
* [x] Include run id, task id, stage id, agent/backend, command index, retry count, status, duration, and artifact path where available.
* [x] Write a per-run log file such as `.nightshift/runs/<run-id>/run.log`.
* [x] Optionally write or rotate an aggregate `.nightshift/nightshift.log` for cross-run troubleshooting.
* [x] Keep logs operational; do not duplicate full prompts, full model responses, or full command output that already lives in artifacts.
* [x] Redact or avoid secrets from logged environment/config values.
* [x] Add dashboard support for viewing the latest log tail.
* [x] Cap the dashboard log view to the last 100 lines by default.
* [x] Keep the full per-run log file available as an artifact unless a later size cap is configured.
* [x] Auto-refresh the dashboard log view with the existing dashboard refresh model.
* [x] Add tests for log writing, CLI progress hooks, dashboard log rendering, missing log files, and the 100-line cap.
Acceptance Criteria:
* A user running NightShift from a terminal can tell which task and stage are active.
* Long Ollama or command stages show enough lifecycle information that the process does not appear hung.
* The latest run log is visible from `nightshift web`.
* The web client displays at most the last 100 log lines by default.
* Logs point users to detailed artifacts instead of replacing them.
* Missing or partial log files do not crash the dashboard.
Notes:
* This phase should not add process control, websockets, authentication, or write actions to the web client.
* If future live streaming is needed, the first version can still use file tailing plus refresh before introducing websockets.
* Operational logs should complement artifacts: artifacts remain the source of detailed prompts, responses, command output, diffs, and summaries.
## Phase 24: Per-Agent Model Parameters
- [x] Add `temperature` to agent config.
- [x] Pass temperature to Ollama/OpenAI-compatible backends.
- [x] Default safely if omitted.
- [x] Add config validation tests.
## Phase 25: Repo Lookup Tools MVP
- [x] Add tool interface for repo operations.
- [x] Implement scoped `list_files`.
- [x] Implement scoped `read_file`.
- [x] Implement scoped `grep`.
- [x] Enforce existing path safety rules.
- [x] Log tool calls as artifacts.
## Phase 26: Planner Code-Discovery Support
- [x] Teach planner prompt to request needed code context.
- [x] Add structured planner output for lookup requests.
- [x] Execute requested lookup tools.
- [x] Save `files-inspected.md`.
- [x] Re-run planner with retrieved context.
## Phase 27: Context Pack Builder
- [x] Add `repo_context` stage.
- [x] Generate `context-pack.md`.
- [x] Include task, acceptance criteria, relevant files, snippets, and constraints.
- [x] Add line-numbered excerpts.
- [x] Add context-size caps.
## Phase 28: Project Context Chart MVP
- [x] Generate `.nightshift/project-context-chart.md`.
- [x] Include files, responsibilities, functions/classes, entry points, tests.
- [x] Use simple regex/parser MVP.
- [x] Update chart during planning.
- [x] Store anchors/line numbers/search terms.
## Phase 29: Code Writer Stage
- [x] Add `code_writer` stage type.
- [x] Feed it task + context pack.
- [x] Require unified diff output.
- [x] Save `proposed.patch`.
- [x] Save `implementation-summary.md`.
## Phase 30: Patch Normalization
- [x] Add `patch_normalizer` stage.
- [x] Support low-temperature formatter model.
- [x] Convert messy model output to valid unified diff.
- [x] Reject missing/ambiguous edits.
- [x] Save `normalized.patch`.
## Phase 31: Patch Validation
- [x] Parse unified diffs.
- [x] Reject malformed patches.
- [x] Enforce scoped paths.
- [x] Reject path traversal.
- [x] Enforce max files/max lines changed.
- [x] Reject forbidden files.
## Phase 32: Patch Apply / Dry Run
- [ ] Add `patch_apply` stage.
- [ ] Support `mode: dry_run`.
- [ ] Support `mode: apply`.
- [ ] Save `applied.patch`.
- [ ] Preserve pre/post git status.
- [ ] Fail cleanly on apply errors.
## Phase 33: Test Feedback Repair Loop
- [ ] Feed test/static failure output back into implementer.
- [ ] Add bounded repair attempts.
- [ ] Save each repair patch.
- [ ] Save repair summaries.
- [ ] Stop after max retry count.
## Phase 34: End-to-End Coding Quickstart
- [ ] Update quickstart to modify real code.
- [ ] Include fake-agent test fixture.
- [ ] Demonstrate lookup → context pack → patch → apply → test.
- [ ] Document dry-run vs apply mode.
---
# Appendix A: Design Decisions and Rationale
## A.1 Local-first architecture
Decision:
* Prefer local models and local execution
Reasoning:
* Cheapness-first design
* Better experimentation
* Better privacy
* Reduced vendor dependency
* Better overnight scalability
---
## A.2 State machine over DAG
Decision:
* Use configurable state-machine workflows
Reasoning:
* One-task-at-a-time execution
* Retry loops are primary workflow behavior
* Easier auditing
* Easier debugging
* Simpler MVP
---
## A.3 YAML configuration
Decision:
* Use declarative YAML config
Reasoning:
* Human-readable
* Easier nested workflow representation
* Safer than arbitrary Python
* Better portability
---
## A.4 Cheapness-first model routing
Decision:
* Use expensive models selectively
Reasoning:
* Overnight pipelines can become token-expensive
* Local models are sufficient for many stages
* Review stages benefit more from premium models
---
## A.5 Strict repository scoping
Decision:
* Limit writable paths and executable commands
Reasoning:
* Prevent accidental damage
* Maintain trust in unattended execution
* Improve auditability
---
## A.6 Reviewable output over autonomy
Decision:
* Produce review packages rather than autonomous shipping
Reasoning:
* Human review remains critical
* Improves safety
* Improves correctness
* Keeps architecture grounded and practical
---
## A.7 Layered context model
Decision:
* Separate project, task, and retry context
Reasoning:
* Reduces token usage
* Prevents context explosion
* Improves signal quality
* Prevents recursive drift
---
## A.8 Artifact-heavy architecture
Decision:
* Persist plans, logs, reviews, outputs, and summaries
Reasoning:
* Debugging
* Prompt experimentation
* A/B testing
* Replayability
* Portfolio visibility
---
## A.9 No parallelism in v1
Decision:
* Execute one task at a time
Reasoning:
* Simpler correctness model
* Easier debugging
* Easier repository safety
* Easier context management
---
## A.10 Declarative pipelines first
Decision:
* No arbitrary Python hooks in v1
Reasoning:
* Safer execution
* Easier reproducibility
* Easier auditing
* Easier portability
---
# Closing Statement
NightShift is intended to explore a practical middle ground between:
* Fully manual software engineering
* Reckless autonomous agent systems
The system assumes that AI agents are useful but unreliable.
NightShift therefore treats agents as bounded workers inside deterministic, auditable, test-driven workflows.
The primary output is not blind autonomy.
The primary output is trustworthy leverage.