Reliability & recovery patterns
Production reliability mechanisms that keep the autonomous pipeline healthy. Covers failure classification, model escalation, circuit breakers, health sweeps, escalation routing, PR remediation, trajectory storage, and per-feature reflections.
Failure classification
When an agent execution fails, RecoveryService.analyzeFailure() categorizes the error and determines a recovery strategy.
Failure categories
| Category | Examples | Default strategy |
|---|---|---|
transient | Network timeout, DNS failure, socket hang up | Retry with exponential backoff |
rate_limit | API throttle (429), quota warning | Pause and wait (5s base delay) |
quota | Monthly usage cap, spending limit | Escalate to user |
validation | Invalid input, schema mismatch | Escalate to user |
tool_error | Bash command failed, file not found | Alternative approach |
test_failure | Unit test failure, build error | Retry with error context |
merge_conflict | Git conflict on rebase | Escalate to user |
dependency | Missing npm package, unresolved import | Retry with context |
authentication | API key expired, token revoked | Escalate to user |
unknown | Unclassified error | Escalate to user |
Recovery strategies
Six strategies, applied based on category:
- retry — Simple retry with delay (transient errors)
- retry_with_context — Retry with previous error output injected into the agent prompt (test failures, dependency issues)
- alternative_approach — Try a different tool or command (tool errors)
- rollback_and_retry — Clear changes, start fresh (corrupted state)
- pause_and_wait — Hold for API recovery (rate limits)
- escalate_to_user — Emit
recovery_escalatedevent, stop retrying (terminal)
Exponential backoff
Transient retries use exponential backoff: base × 2^retryCount, capped at maxDelay.
Agent-level backoff (RecoveryService):
| Parameter | Value |
|---|---|
| Base delay | 1,000 ms |
| Max delay | 30,000 ms |
| Max transient retries | 3 |
| Max test failure retries | 2 |
| Rate limit base delay | 5,000 ms |
Git workflow backoff (git-workflow-service):
| Parameter | Value |
|---|---|
| Base delay | 2,000 ms |
| Max retries | 3 |
| Backoff | 2s → 4s → 8s |
| Applies to | git push, gh pr create operations |
The retryWithExponentialBackoff<T>() helper in git-workflow-service.ts wraps push and PR creation calls. This prevents transient GitHub/network errors from causing silent git workflow failures.
Source: apps/server/src/services/git-workflow-service.ts
Lesson generation
After 3+ failures of the same category for a project, RecoveryService.checkAndGenerateLessons() writes a guidance context file to .automaker/context/failure-lessons-{category}.md. Future agents automatically receive this guidance via the context loading system.
Source: apps/server/src/services/recovery-service.ts
Model auto-escalation
The model tier isn't fixed for a feature's lifetime. The escalation chain:
Haiku → Sonnet → Opus → ESCALATE (human)When escalation triggers
- Feature fails 2+ times at the current tier
- Test failures persist after retry with context
- Agent hits turn limit without completing
How it works
The Lead Engineer state machine tracks failureCount per feature. On the 2nd+ failure:
- Feature enters ESCALATE state,
FailureClassifierServicecategorizes the error - INTAKE phase on retry selects the next model tier (Haiku → Sonnet, Sonnet → Opus)
- Feature retries with the higher-capability model
- If Opus also fails → stays in ESCALATE, human intervention required
FeatureSchedulercircuit breaker pauses auto-mode after 3 consecutive failures
This captures the human pattern: "This is harder than I thought, let me think more carefully."
Circuit breaker
The auto-mode orchestration loop includes a circuit breaker that prevents cascading failures.
Behavior
| Parameter | Value |
|---|---|
| Failure threshold | 2 failures in 60 seconds |
| Action | Pause auto-mode |
| Resume after | 5 minutes (automatic) |
When 2 features fail within a 60-second window, auto-mode pauses. This prevents burning API credits on a systemic issue (e.g., API outage, broken build on main).
After 5 minutes, auto-mode resumes automatically. If the issue persists, the circuit breaker trips again.
Integration
The circuit breaker is evaluated in the auto-mode tick loop, not in the Lead Engineer. The orchestration loop is the scheduler; the state machine is the executor.
Health sweep
Every ~100 seconds (50 iterations at a 2-second interval), the auto-mode loop runs FeatureHealthService.audit() with auto-fix enabled. This catches structural drift on the board.
Issue types
| Issue type | Detection | Auto-fix |
|---|---|---|
orphaned_epic_ref | Feature references non-existent or non-epic parent | Clear epicId reference |
dangling_dependency | Feature depends on deleted features | Remove non-existent dep IDs |
epic_children_done | All child features done, but epic still in-progress | Set epic status to done |
stale_running | Feature marked in_progress with no active agent | Reset to backlog |
stale_gate | Feature awaiting pipeline gate for >1 hour | Move to blocked |
merged_not_done | Branch merged to main but feature not marked done | Set status to done |
How it works
const report = await featureHealthService.audit(projectPath, true); // autoFix=true
// report.issues — all detected problems
// report.fixed — problems that were auto-correctedEach detected issue emits an escalation:signal-received event with a deduplication key, so the escalation router can alert without flooding.
Safety
- Uses
execFileAsync(not shell) for git operations — prevents injection - Detects both
mainandmasteras default branches - Caches epic branch
--mergedresults to reduce git calls
Source: apps/server/src/services/feature-health-service.ts
Escalation router
When recovery fails or health sweep finds unfixable issues, signals route to notification channels via EscalationRouter.
Signal flow
Recovery failure / Health issue / Lead Engineer escalation
↓
EscalationRouter.routeSignal(signal)
├── Deduplication check (30-min window)
│ └── Duplicate? → emit 'escalation:signal-deduplicated', skip
├── Severity filter
│ └── Low severity? → log only, no routing
├── Per-channel rate limit check
│ └── Rate limited? → add to rateLimited list, skip channel
└── Send to matching channels
└── emit 'escalation:signal-sent' per channelSignal severity
| Severity | Behavior |
|---|---|
low | Logged only, not routed to channels |
medium | Routed to matching channels |
high | Routed to all matching channels |
critical | Routed to all channels, bypasses rate limits |
Deduplication
Signals carry a deduplicationKey (e.g., "escalation:feature-123:test-failure"). If the same key was seen within the last 30 minutes, the signal is deduplicated — logged but not re-routed.
Rate limiting
Each channel can define a rate limit:
interface EscalationChannel {
name: string;
canHandle(signal: EscalationSignal): boolean;
send(signal: EscalationSignal): Promise<void>;
rateLimit?: { maxSignals: number; windowMs: number };
}Example: Discord might limit to 5 signals per hour. The router tracks per-channel counters and skips channels that exceed their limit.
Acknowledgment
Signals can be acknowledged via acknowledgeSignal(deduplicationKey, acknowledgedBy, notes?, clearDedup?). This marks the signal as handled in the escalation log and optionally clears the deduplication window.
Audit log
The router maintains a log of up to 1,000 entries (most recent first). Each entry records:
- The signal and its severity
- Which channels received it
- Whether it was deduplicated or rate-limited
- Acknowledgment status
Source: apps/server/src/services/escalation-router.ts
PR remediation loop
When a PR fails CI or receives review feedback, the system enters a remediation loop.
Flow
PR created → CI runs + CodeRabbit reviews
├── CI passes + approved → MERGE
├── CI fails → extract failure context → back to EXECUTE
├── changes_requested → collect feedback → send to agent for fixes
└── Max retries exceeded → ESCALATELimits
| Parameter | Value |
|---|---|
| Max CI retry cycles | 2 (back to EXECUTE with failure context) |
| Max feedback cycles | 2 (agent addresses reviewer comments) |
| Max total remediation | 4 cycles before escalation |
| PR poll interval | 60 seconds |
How feedback flows
PRFeedbackServicepolls GitHub every 60 seconds for new review activity- On
changes_requested: feedback is collected and sent to the agent - The agent addresses feedback in the worktree and pushes
- CI re-runs, CodeRabbit re-reviews
- On
approved+ CI passing → MERGE
The PR remediation loop handles CI failures automatically by analyzing feedback and pushing fixes.
Trajectory store
TrajectoryStoreService persists verified execution trajectories for learning.
Storage
.automaker/trajectory/{featureId}/attempt-{N}.jsonEach trajectory records:
- Feature metadata (ID, title, complexity)
- Execution outcome (success/failure)
- Key decisions the agent made
- Recovery strategies that worked
- Failure patterns encountered
- Duration and token usage
Non-blocking writes
Trajectory writes are fire-and-forget. They never block the agent execution loop. If the write fails (disk full, permissions), the feature still completes normally.
Sibling reflections
When a feature enters EXECUTE, the Lead Engineer loads trajectories from recently completed sibling features:
const siblings = features
.filter((f) => f.status === 'verified' && f.lastExecutionTime)
.sort((a, b) => (b.lastExecutionTime || 0) - (a.lastExecutionTime || 0))
.slice(0, 3); // max 3 reflectionsSibling matching: Same epicId (if in an epic) or same projectSlug (if standalone).
These reflections are injected into the agent's context as "Lessons from Similar Features" (max ~500 tokens), giving each agent the benefit of what prior agents learned.
Source: apps/server/src/services/trajectory-store-service.ts
Per-feature reflection loop
After each feature reaches DONE, a lightweight reflection is generated.
How it works
DeployProcessor.generateReflection()fires non-blocking after marking a feature done- Reads the tail of
agent-output.md(last 2,000 chars) plus execution metadata - Calls
simpleQuery()with Haiku (maxTurns: 1, no tools) to produce a structured reflection under 200 words - Writes result to
.automaker/features/{id}/reflection.md - Emits
feature:reflection:completeevent
Feed-forward
Reflections from completed siblings are loaded during EXECUTE (see Trajectory Store above). This creates an in-project learning loop — each feature benefits from the last.
Cost
~$0.001 per reflection (Haiku, single turn, no tools). Fire-and-forget — failure does not block the state machine.
Observability
Reflection LLM calls are traced in Langfuse with:
- Tags:
feature:{id},role:reflection - Metadata:
featureId,featureName,agentRole: 'reflection'
FailureClassifierService
Pattern-matches escalation reason strings to structured failure categories and recovery strategies.
Purpose
When the Lead Engineer's ESCALATE state receives an escalation reason string (e.g., "Rate limit exceeded", "Tests failed after 3 retries"), the classifier maps it to a FailureCategory and suggests a RecoveryStrategy.
Integration
Called by EscalateProcessor.process() in the Lead Engineer state machine. The classified category determines:
- Whether to retry or escalate
- Which model tier to use on retry
- What context to inject into the agent prompt
Source: apps/server/src/services/failure-classifier-service.ts
Crash recovery scan
On server startup, a non-blocking worktree scan detects stranded work from crashed agent sessions.
How it works
After resumeInterruptedFeatures() completes, scanWorktreesForCrashRecovery() runs via setImmediate():
- Lists all worktrees via
git worktree list --porcelain - Cross-references each worktree with its feature status
- For features in
verified/donewith uncommitted or unpushed work:- Commits stranded changes via
ensureCleanWorktree() - Pushes to remote
- Triggers
runPostCompletionWorkflow()(PR creation)
- Commits stranded changes via
- For features in other states with stranded work: logs a warning
- Emits
maintenance:crash_recovery_scan_completedwith summary
When it triggers
- Server startup only (not a cron task)
- Non-blocking — does not delay server initialization or request handling
- Fire-and-forget — scan failures are logged but don't crash the server
Source: apps/server/src/services/maintenance-tasks.ts
Git workflow error surfacing
When git operations (commit, push, PR creation) fail after agent execution, the error is stored on the feature for UI visibility.
The gitWorkflowError field
interface Feature {
gitWorkflowError?: {
message: string; // Error description
timestamp: string; // ISO 8601 when the error occurred
};
}All 4 git workflow catch blocks in auto-mode-service.ts persist errors to feature.json instead of silently logging them. Feature status remains unchanged (e.g., stays verified) — the error field provides a separate visibility channel.
Source: libs/types/src/feature.ts, apps/server/src/services/auto-mode-service.ts
Event-driven observability
All reliability services emit events for real-time UI updates and audit logging:
| Service | Event prefix | Key events |
|---|---|---|
| RecoveryService | recovery_* | analysis, started, completed, recorded, escalated, lesson_generated |
| EscalationRouter | escalation:* | signal-received, deduplicated, sent, failed, routed, acknowledged |
| FeatureHealthService | (via auto-mode) | Issues surface through escalation events |
| AutoModeService | feature:* | status-changed, completed, error |
| Lead Engineer | feature:* | reflection:complete, pr-merged, state-changed |
| Maintenance | maintenance:* | crash_recovery_scan_completed |
Recovery architecture diagram
Feature Execution Fails
↓
RecoveryService.analyzeFailure()
├── categorizeFailure() → FailureCategory
├── determineStrategy() → RecoveryStrategy
├── recordRecoveryAttempt() → JSONL log + emit events
└── checkAndGenerateLessons() (after 3+ failures)
└── Write failure-lessons-{category}.md to context/
↓
Recovery result: { success, shouldRetry, actionTaken }
├── If retryable → AutoModeService.retry() with injected context
└── If escalate → EscalationRouter.routeSignal()
├── Dedup check (30-min window)
├── Rate limit check (per-channel)
└── Send to registered channels
└── EscalationLogEntry recorded for audit trail
↓
(Parallel) FeatureHealthService.audit()
└── Check for drift: orphaned refs, stale running, merged branches
└── Auto-fix if enabled → Update feature status
↓
Lead Engineer State Machine
├── [EXECUTE] Load sibling reflections from trajectory store
├── [REMEDIATION] Inject failure context + review feedback
└── [ESCALATE] FailureClassifierService maps reason → categoryRelated documentation
- Agent Philosophy — Why the system is designed this way
- Idea to Production — Full pipeline with escalation points
- Langfuse Integration — Tracing and cost tracking