15 KiB
NightShift
Auditable Local-First AI Coding Pipelines
Version: v0.1 Draft Author: K455 Status: Design Proposal
1. Executive Summary
NightShift is a local-first AI pipeline runner designed to execute long-running coding workflows against a constrained project workspace.
The system is intended to run overnight or unattended for extended periods while remaining:
- Cheap
- Correct
- Auditable
- Safe
- Reviewable
NightShift is not designed to be a fully autonomous "AI software engineer." Instead, it is a deterministic orchestration system that allows fallible AI agents to operate within constrained, test-driven, auditable workflows.
The core philosophy is:
Treat LLMs like unreliable distributed systems.
Agents are bounded by:
- Scoped repository access
- Structured stage contracts
- Explicit retry behavior
- Tests and static checks
- Review stages
- Context compaction
- Artifact logging
The intended workflow is:
-
User provides:
- Repository
- Task list
- Pipeline configuration
- Agent definitions
-
NightShift:
- Selects the next task
- Generates a plan
- Reviews the plan
- Implements changes
- Runs tests/static analysis
- Reviews results
- Retries if necessary
- Produces an overnight report
The result is a reviewable repository state and a full audit trail of AI behavior.
2. Goals
2.1 Primary Goals
Local-first execution
The system should work primarily with local models and local execution environments.
Examples:
- Ollama
- Local transformers
- Local agent runtimes
- Claude Code
- Codex CLI
Long-running unattended workflows
NightShift should support:
- Overnight execution
- Large task chains
- Multi-stage workflows
- Automated retries
- Context handoff between stages
Auditability
Every important action should be recorded.
Users should be able to inspect:
- Prompts
- Plans
- Reviews
- Command outputs
- Diffs
- Test results
- Retry reasoning
- Final summaries
Cheapness-first execution
The orchestration layer should assume:
- Cheap local models handle most work
- Expensive models are escalation layers
- Context size matters
- Token usage matters
- Retry cost matters
Safe repository boundaries
The system should:
- Restrict file access
- Restrict shell commands
- Avoid destructive operations
- Minimize repository damage
2.2 Non-Goals (v1)
The following are intentionally out of scope for v1:
- Fully autonomous software development
- Parallel distributed execution
- Automatic deployment
- Cloud-native orchestration
- Dynamic self-modifying pipelines
- Autonomous internet access
- Agent swarms
- Arbitrary Python execution hooks
- Automatic git pushes
- Full DAG orchestration
3. Design Philosophy
NightShift is built around several core principles.
3.1 Deterministic orchestration
Agents are nondeterministic.
The orchestration system should not be.
Pipeline behavior should be:
- Predictable
- Reproducible
- Configurable
- Explicit
3.2 Structured state transitions
NightShift uses a state-machine workflow model.
A task moves through defined stages:
Task Queue
-> Plan
-> Plan Review
-> Implement
-> Test
-> Static Check
-> Review
-> Retry / Complete
Each stage produces:
status: pass | fail | retry | escalate
reason: string
next_stage: optional
context_update: optional
This allows the pipeline runner to remain deterministic even while agents are probabilistic.
3.3 Context compaction
Agents should not inherit unlimited history.
Instead:
- Project-level context is persistent and compact
- Task-level context is scoped
- Retry context is summarized
- Stage context is minimized
This reduces:
- Token costs
- Context poisoning
- Hallucination drift
- Recursive confusion
3.4 Reviewability over autonomy
NightShift is optimized to produce:
- Reviewable code
- Reviewable reports
- Reviewable reasoning
The primary output is:
A useful morning review state.
Not:
Fully autonomous shipping.
4. Architecture Overview
4.1 High-Level Components
+-------------------+
| Task Parser |
+-------------------+
|
v
+-------------------+
| Pipeline Runner |
+-------------------+
|
v
+-------------------+
| Stage Executor |
+-------------------+
| |
| +----------------+
| |
v v
+-----------+ +----------------+
| Agent API | | Command Runner |
+-----------+ +----------------+
| |
v v
+-----------+ +----------------+
| LLM Model | | Test/Lint/etc |
+-----------+ +----------------+
4.2 Core Components
Task Parser
Responsible for:
- Reading markdown task files
- Parsing acceptance criteria
- Tracking completion state
- Determining dependencies
Pipeline Runner
Responsible for:
- Stage orchestration
- Retry logic
- State transitions
- Artifact management
- Context propagation
Stage Executor
Responsible for:
- Executing stage definitions
- Calling agents
- Running commands
- Collecting outputs
Agent Layer
Responsible for:
- Prompt construction
- Model backend integration
- Structured output parsing
- Context injection
Command Runner
Responsible for:
- Executing tests
- Static analysis
- Formatting
- Shell command restrictions
- Sandboxing
5. Workflow Model
5.1 State Machine Model
NightShift uses a configurable state-machine workflow.
This was selected over:
- DAG orchestration
- Arbitrary scripting
because:
- v1 executes one task at a time
- Retry loops are first-class
- Auditability is easier
- Deterministic transitions are simpler
5.2 Default Pipeline
PLAN
↓
REVIEW_PLAN
↓
IMPLEMENT
↓
TEST
↓
STATIC_ANALYSIS
↓
REVIEW
↓
DECISION
Decision outcomes:
- COMPLETE
- RETRY_IMPLEMENTATION
- RETRY_PLANNING
- FAIL
5.3 Configurable Pipelines
Pipelines are defined declaratively.
Users may:
- Swap stage orders
- Add/remove stages
- Define retry behavior
- Use different models
- A/B test prompts
- Experiment with reasoning structures
6. Configuration System
6.1 Configuration Format
NightShift uses YAML configuration files.
Reasons:
- Human-readable
- Good nested structure support
- Easier workflow representation than TOML
- Safer than arbitrary Python execution
6.2 Example Configuration
project:
name: my-project
root: .
task_file: tasks.md
artifact_dir: .nightshift
safety:
require_clean_worktree: true
scoped_paths:
- src/
- tests/
forbidden_commands:
- rm -rf
- git push
allowed_commands:
- cargo test
- cargo fmt
- cargo clippy
agents:
planner:
backend: ollama
model: qwen2.5-coder:14b
system_prompt: agents/planner.md
implementer:
backend: claude-code
model: sonnet
system_prompt: agents/implementer.md
reviewer:
backend: ollama
model: deepseek-r1:32b
system_prompt: agents/reviewer.md
pipeline:
max_task_retries: 3
stages:
- id: plan
type: agent
agent: planner
- id: review_plan
type: review
agent: reviewer
on_fail: plan
- id: implement
type: agent
agent: implementer
- id: test
type: command
commands:
- cargo test
- id: static
type: command
commands:
- cargo fmt --check
- cargo clippy -- -D warnings
- id: review
type: review
agent: reviewer
on_fail: implement
7. Task System
7.1 Task Format
Tasks are defined in markdown.
Example:
- [ ] TASK-001: Add retry support to pipeline runner
Acceptance Criteria:
- Retries configurable per stage
- Retry summaries persisted
- Retry count visible in final report
7.2 Task Lifecycle
Each task:
- Is parsed
- Is assigned a workspace
- Receives planning
- Receives implementation
- Is validated
- Is reviewed
- Produces artifacts
- Is marked complete or failed
7.3 Task Dependencies
Future versions may support:
TASK-003 depends on TASK-001
However:
- Tasks should remain independently testable when possible
- Pipelines should maintain a buildable repository state
8. Agent Model
8.1 Agent Roles
Agents are specialized.
Example roles:
- planner
- implementer
- reviewer
- summarizer
- test-writer
8.2 Agent Definitions
Agents are configurable.
Each agent defines:
- Backend
- Model
- System prompt
- Constraints
- Output schema
8.3 Multi-Backend Support
NightShift should support:
- Ollama
- Claude Code
- Codex CLI
- Future local runners
This allows:
- Cheap local planning
- Expensive selective escalation
- Hybrid pipelines
8.4 Structured Outputs
Agents should emit machine-readable results.
Example:
status: pass
summary: |
Tests succeeded.
issues:
- None
next_stage: review
9. Context System
9.1 Context Layers
NightShift uses layered context.
Project Context
Long-lived information:
- Architecture
- Coding standards
- Constraints
- Previous summaries
Task Context
Task-specific information:
- Acceptance criteria
- Relevant files
- Prior retries
- Implementation notes
Retry Context
Compact summaries of:
- Previous failures
- Previous reviews
- Previous test errors
9.2 Context Compaction
Every stage should summarize output.
This prevents:
- Infinite context growth
- Token explosion
- Recursive hallucination
- Low-signal history accumulation
10. Safety Model
10.1 Repository Scope Restrictions
NightShift should restrict:
- Accessible directories
- Writable paths
- Executable commands
10.2 Command Restrictions
Commands are allowlisted.
Potentially dangerous commands are forbidden.
Examples:
Forbidden:
- rm -rf
- git push
- curl | bash
10.3 Clean Worktree Requirement
v1 may optionally require:
git status == clean
before execution.
This simplifies:
- Auditability
- Recovery
- Diff inspection
11. Testing and Validation
11.1 Validation Pipeline
Validation occurs in multiple stages:
Tests
↓
Static Analysis
↓
Review Agent
↓
Decision
11.2 Global Test Suite
Tests are global.
Rationale:
- New changes must not break old functionality
- Pipeline should maintain cumulative stability
11.3 Generated Tests
Agents may generate tests for features.
Generated tests become part of the persistent suite.
12. Artifact System
12.1 Artifact Goals
Artifacts provide:
- Auditability
- Replayability
- Debugging
- Historical inspection
- Prompt experimentation
12.2 Example Layout
.nightshift/
project-context.md
runs/
2026-05-16-overnight/
run-summary.md
config.snapshot.yaml
tasks/
TASK-001/
task.md
plan.md
plan-review.md
implementation-log.md
test-output.txt
static-output.txt
review.md
final-notes.md
diff.patch
context-out.md
13. Overnight Report
At completion NightShift generates:
- Completed tasks
- Failed tasks
- Retry counts
- Files modified
- Test results
- Reviewer summaries
- Remaining issues
- Suggested follow-up work
The goal is:
Wake up to a review package.
14. Future Directions
Potential future features:
- Parallel task execution
- DAG workflows
- Distributed workers
- Sandboxed containers
- Git branch isolation
- Agent tournaments
- Constraint language experimentation
- Prompt A/B testing
- Semantic memory systems
- Multi-repo orchestration
- Web dashboard
- Cost telemetry
- Human approval gates
15. Risks
15.1 Context poisoning
Mitigation:
- Context compaction
- Retry summarization
- Structured stage boundaries
15.2 Agent loops
Mitigation:
- Explicit retry counts
- Deterministic transitions
- Timeout handling
15.3 Repository damage
Mitigation:
- Scoped directories
- Command restrictions
- Validation stages
15.4 Cost explosion
Mitigation:
- Local-first execution
- Context minimization
- Escalation-only expensive models
16. MVP Definition
The minimum viable NightShift implementation should:
- Parse markdown tasks
- Execute a declarative pipeline
- Support local agents
- Generate plans
- Generate implementations
- Run tests
- Run static analysis
- Run review agents
- Retry failed stages
- Produce artifacts
- Produce an overnight summary
- Restrict repository access
This MVP is sufficient to:
- Demonstrate orchestration architecture
- Demonstrate AI pipeline engineering
- Demonstrate safety-aware automation
- Serve as a strong portfolio project
Appendix A: Design Decisions and Rationale
A.1 Local-first architecture
Decision:
- Prefer local models and local execution
Reasoning:
- Cheapness-first design
- Better experimentation
- Better privacy
- Reduced vendor dependency
- Better overnight scalability
A.2 State machine over DAG
Decision:
- Use configurable state-machine workflows
Reasoning:
- One-task-at-a-time execution
- Retry loops are primary workflow behavior
- Easier auditing
- Easier debugging
- Simpler MVP
A.3 YAML configuration
Decision:
- Use declarative YAML config
Reasoning:
- Human-readable
- Easier nested workflow representation
- Safer than arbitrary Python
- Better portability
A.4 Cheapness-first model routing
Decision:
- Use expensive models selectively
Reasoning:
- Overnight pipelines can become token-expensive
- Local models are sufficient for many stages
- Review stages benefit more from premium models
A.5 Strict repository scoping
Decision:
- Limit writable paths and executable commands
Reasoning:
- Prevent accidental damage
- Maintain trust in unattended execution
- Improve auditability
A.6 Reviewable output over autonomy
Decision:
- Produce review packages rather than autonomous shipping
Reasoning:
- Human review remains critical
- Improves safety
- Improves correctness
- Keeps architecture grounded and practical
A.7 Layered context model
Decision:
- Separate project, task, and retry context
Reasoning:
- Reduces token usage
- Prevents context explosion
- Improves signal quality
- Prevents recursive drift
A.8 Artifact-heavy architecture
Decision:
- Persist plans, logs, reviews, outputs, and summaries
Reasoning:
- Debugging
- Prompt experimentation
- A/B testing
- Replayability
- Portfolio visibility
A.9 No parallelism in v1
Decision:
- Execute one task at a time
Reasoning:
- Simpler correctness model
- Easier debugging
- Easier repository safety
- Easier context management
A.10 Declarative pipelines first
Decision:
- No arbitrary Python hooks in v1
Reasoning:
- Safer execution
- Easier reproducibility
- Easier auditing
- Easier portability
Closing Statement
NightShift is intended to explore a practical middle ground between:
- Fully manual software engineering
- Reckless autonomous agent systems
The system assumes that AI agents are useful but unreliable.
NightShift therefore treats agents as bounded workers inside deterministic, auditable, test-driven workflows.
The primary output is not blind autonomy.
The primary output is trustworthy leverage.