Skip to content

Reliability & recovery patterns

Production reliability mechanisms that keep the autonomous pipeline healthy. Covers failure classification, model escalation, circuit breakers, health sweeps, escalation routing, PR remediation, trajectory storage, and per-feature reflections.

Failure classification

When an agent execution fails, RecoveryService.analyzeFailure() categorizes the error and determines a recovery strategy.

Failure categories

CategoryExamplesDefault strategy
transientNetwork timeout, DNS failure, socket hang upRetry with exponential backoff
rate_limitAPI throttle (429), quota warningPause and wait (5s base delay)
quotaMonthly usage cap, spending limitEscalate to user
validationInvalid input, schema mismatchEscalate to user
tool_errorBash command failed, file not foundAlternative approach
test_failureUnit test failure, build errorRetry with error context
merge_conflictGit conflict on rebaseEscalate to user
dependencyMissing npm package, unresolved importRetry with context
authenticationAPI key expired, token revokedEscalate to user
unknownUnclassified errorEscalate to user

Recovery strategies

Six strategies, applied based on category:

  1. retry — Simple retry with delay (transient errors)
  2. retry_with_context — Retry with previous error output injected into the agent prompt (test failures, dependency issues)
  3. alternative_approach — Try a different tool or command (tool errors)
  4. rollback_and_retry — Clear changes, start fresh (corrupted state)
  5. pause_and_wait — Hold for API recovery (rate limits)
  6. escalate_to_user — Emit recovery_escalated event, stop retrying (terminal)

Exponential backoff

Transient retries use exponential backoff: base × 2^retryCount, capped at maxDelay.

Agent-level backoff (RecoveryService):

ParameterValue
Base delay1,000 ms
Max delay30,000 ms
Max transient retries3
Max test failure retries2
Rate limit base delay5,000 ms

Git workflow backoff (git-workflow-service):

ParameterValue
Base delay2,000 ms
Max retries3
Backoff2s → 4s → 8s
Applies togit push, gh pr create operations

The retryWithExponentialBackoff<T>() helper in git-workflow-service.ts wraps push and PR creation calls. This prevents transient GitHub/network errors from causing silent git workflow failures.

Source: apps/server/src/services/git-workflow-service.ts

Lesson generation

After 3+ failures of the same category for a project, RecoveryService.checkAndGenerateLessons() writes a guidance context file to .automaker/context/failure-lessons-{category}.md. Future agents automatically receive this guidance via the context loading system.

Source: apps/server/src/services/recovery-service.ts

Model auto-escalation

The model tier isn't fixed for a feature's lifetime. The escalation chain:

Haiku → Sonnet → Opus → ESCALATE (human)

When escalation triggers

  • Feature fails 2+ times at the current tier
  • Test failures persist after retry with context
  • Agent hits turn limit without completing

How it works

The Lead Engineer state machine tracks failureCount per feature. On the 2nd+ failure:

  1. Feature enters ESCALATE state, FailureClassifierService categorizes the error
  2. INTAKE phase on retry selects the next model tier (Haiku → Sonnet, Sonnet → Opus)
  3. Feature retries with the higher-capability model
  4. If Opus also fails → stays in ESCALATE, human intervention required
  5. FeatureScheduler circuit breaker pauses auto-mode after 3 consecutive failures

This captures the human pattern: "This is harder than I thought, let me think more carefully."

Circuit breaker

The auto-mode orchestration loop includes a circuit breaker that prevents cascading failures.

Behavior

ParameterValue
Failure threshold2 failures in 60 seconds
ActionPause auto-mode
Resume after5 minutes (automatic)

When 2 features fail within a 60-second window, auto-mode pauses. This prevents burning API credits on a systemic issue (e.g., API outage, broken build on main).

After 5 minutes, auto-mode resumes automatically. If the issue persists, the circuit breaker trips again.

Integration

The circuit breaker is evaluated in the auto-mode tick loop, not in the Lead Engineer. The orchestration loop is the scheduler; the state machine is the executor.

Health sweep

Every ~100 seconds (50 iterations at a 2-second interval), the auto-mode loop runs FeatureHealthService.audit() with auto-fix enabled. This catches structural drift on the board.

Issue types

Issue typeDetectionAuto-fix
orphaned_epic_refFeature references non-existent or non-epic parentClear epicId reference
dangling_dependencyFeature depends on deleted featuresRemove non-existent dep IDs
epic_children_doneAll child features done, but epic still in-progressSet epic status to done
stale_runningFeature marked in_progress with no active agentReset to backlog
stale_gateFeature awaiting pipeline gate for >1 hourMove to blocked
merged_not_doneBranch merged to main but feature not marked doneSet status to done

How it works

typescript
const report = await featureHealthService.audit(projectPath, true); // autoFix=true

// report.issues — all detected problems
// report.fixed  — problems that were auto-corrected

Each detected issue emits an escalation:signal-received event with a deduplication key, so the escalation router can alert without flooding.

Safety

  • Uses execFileAsync (not shell) for git operations — prevents injection
  • Detects both main and master as default branches
  • Caches epic branch --merged results to reduce git calls

Source: apps/server/src/services/feature-health-service.ts

Escalation router

When recovery fails or health sweep finds unfixable issues, signals route to notification channels via EscalationRouter.

Signal flow

Recovery failure / Health issue / Lead Engineer escalation

EscalationRouter.routeSignal(signal)
    ├── Deduplication check (30-min window)
    │   └── Duplicate? → emit 'escalation:signal-deduplicated', skip
    ├── Severity filter
    │   └── Low severity? → log only, no routing
    ├── Per-channel rate limit check
    │   └── Rate limited? → add to rateLimited list, skip channel
    └── Send to matching channels
        └── emit 'escalation:signal-sent' per channel

Signal severity

SeverityBehavior
lowLogged only, not routed to channels
mediumRouted to matching channels
highRouted to all matching channels
criticalRouted to all channels, bypasses rate limits

Deduplication

Signals carry a deduplicationKey (e.g., "escalation:feature-123:test-failure"). If the same key was seen within the last 30 minutes, the signal is deduplicated — logged but not re-routed.

Rate limiting

Each channel can define a rate limit:

typescript
interface EscalationChannel {
  name: string;
  canHandle(signal: EscalationSignal): boolean;
  send(signal: EscalationSignal): Promise<void>;
  rateLimit?: { maxSignals: number; windowMs: number };
}

Example: Discord might limit to 5 signals per hour. The router tracks per-channel counters and skips channels that exceed their limit.

Acknowledgment

Signals can be acknowledged via acknowledgeSignal(deduplicationKey, acknowledgedBy, notes?, clearDedup?). This marks the signal as handled in the escalation log and optionally clears the deduplication window.

Audit log

The router maintains a log of up to 1,000 entries (most recent first). Each entry records:

  • The signal and its severity
  • Which channels received it
  • Whether it was deduplicated or rate-limited
  • Acknowledgment status

Source: apps/server/src/services/escalation-router.ts

PR remediation loop

When a PR fails CI or receives review feedback, the system enters a remediation loop.

Flow

PR created → CI runs + CodeRabbit reviews
    ├── CI passes + approved → MERGE
    ├── CI fails → extract failure context → back to EXECUTE
    ├── changes_requested → collect feedback → send to agent for fixes
    └── Max retries exceeded → ESCALATE

Limits

ParameterValue
Max CI retry cycles2 (back to EXECUTE with failure context)
Max feedback cycles2 (agent addresses reviewer comments)
Max total remediation4 cycles before escalation
PR poll interval60 seconds

How feedback flows

  1. PRFeedbackService polls GitHub every 60 seconds for new review activity
  2. On changes_requested: feedback is collected and sent to the agent
  3. The agent addresses feedback in the worktree and pushes
  4. CI re-runs, CodeRabbit re-reviews
  5. On approved + CI passing → MERGE

The PR remediation loop handles CI failures automatically by analyzing feedback and pushing fixes.

Trajectory store

TrajectoryStoreService persists verified execution trajectories for learning.

Storage

.automaker/trajectory/{featureId}/attempt-{N}.json

Each trajectory records:

  • Feature metadata (ID, title, complexity)
  • Execution outcome (success/failure)
  • Key decisions the agent made
  • Recovery strategies that worked
  • Failure patterns encountered
  • Duration and token usage

Non-blocking writes

Trajectory writes are fire-and-forget. They never block the agent execution loop. If the write fails (disk full, permissions), the feature still completes normally.

Sibling reflections

When a feature enters EXECUTE, the Lead Engineer loads trajectories from recently completed sibling features:

typescript
const siblings = features
  .filter((f) => f.status === 'verified' && f.lastExecutionTime)
  .sort((a, b) => (b.lastExecutionTime || 0) - (a.lastExecutionTime || 0))
  .slice(0, 3); // max 3 reflections

Sibling matching: Same epicId (if in an epic) or same projectSlug (if standalone).

These reflections are injected into the agent's context as "Lessons from Similar Features" (max ~500 tokens), giving each agent the benefit of what prior agents learned.

Source: apps/server/src/services/trajectory-store-service.ts

Per-feature reflection loop

After each feature reaches DONE, a lightweight reflection is generated.

How it works

  1. DeployProcessor.generateReflection() fires non-blocking after marking a feature done
  2. Reads the tail of agent-output.md (last 2,000 chars) plus execution metadata
  3. Calls simpleQuery() with Haiku (maxTurns: 1, no tools) to produce a structured reflection under 200 words
  4. Writes result to .automaker/features/{id}/reflection.md
  5. Emits feature:reflection:complete event

Feed-forward

Reflections from completed siblings are loaded during EXECUTE (see Trajectory Store above). This creates an in-project learning loop — each feature benefits from the last.

Cost

~$0.001 per reflection (Haiku, single turn, no tools). Fire-and-forget — failure does not block the state machine.

Observability

Reflection LLM calls are traced in Langfuse with:

  • Tags: feature:{id}, role:reflection
  • Metadata: featureId, featureName, agentRole: 'reflection'

FailureClassifierService

Pattern-matches escalation reason strings to structured failure categories and recovery strategies.

Purpose

When the Lead Engineer's ESCALATE state receives an escalation reason string (e.g., "Rate limit exceeded", "Tests failed after 3 retries"), the classifier maps it to a FailureCategory and suggests a RecoveryStrategy.

Integration

Called by EscalateProcessor.process() in the Lead Engineer state machine. The classified category determines:

  • Whether to retry or escalate
  • Which model tier to use on retry
  • What context to inject into the agent prompt

Source: apps/server/src/services/failure-classifier-service.ts

Crash recovery scan

On server startup, a non-blocking worktree scan detects stranded work from crashed agent sessions.

How it works

After resumeInterruptedFeatures() completes, scanWorktreesForCrashRecovery() runs via setImmediate():

  1. Lists all worktrees via git worktree list --porcelain
  2. Cross-references each worktree with its feature status
  3. For features in verified/done with uncommitted or unpushed work:
    • Commits stranded changes via ensureCleanWorktree()
    • Pushes to remote
    • Triggers runPostCompletionWorkflow() (PR creation)
  4. For features in other states with stranded work: logs a warning
  5. Emits maintenance:crash_recovery_scan_completed with summary

When it triggers

  • Server startup only (not a cron task)
  • Non-blocking — does not delay server initialization or request handling
  • Fire-and-forget — scan failures are logged but don't crash the server

Source: apps/server/src/services/maintenance-tasks.ts

Git workflow error surfacing

When git operations (commit, push, PR creation) fail after agent execution, the error is stored on the feature for UI visibility.

The gitWorkflowError field

typescript
interface Feature {
  gitWorkflowError?: {
    message: string; // Error description
    timestamp: string; // ISO 8601 when the error occurred
  };
}

All 4 git workflow catch blocks in auto-mode-service.ts persist errors to feature.json instead of silently logging them. Feature status remains unchanged (e.g., stays verified) — the error field provides a separate visibility channel.

Source: libs/types/src/feature.ts, apps/server/src/services/auto-mode-service.ts

Event-driven observability

All reliability services emit events for real-time UI updates and audit logging:

ServiceEvent prefixKey events
RecoveryServicerecovery_*analysis, started, completed, recorded, escalated, lesson_generated
EscalationRouterescalation:*signal-received, deduplicated, sent, failed, routed, acknowledged
FeatureHealthService(via auto-mode)Issues surface through escalation events
AutoModeServicefeature:*status-changed, completed, error
Lead Engineerfeature:*reflection:complete, pr-merged, state-changed
Maintenancemaintenance:*crash_recovery_scan_completed

Recovery architecture diagram

Feature Execution Fails

RecoveryService.analyzeFailure()
    ├── categorizeFailure() → FailureCategory
    ├── determineStrategy() → RecoveryStrategy
    ├── recordRecoveryAttempt() → JSONL log + emit events
    └── checkAndGenerateLessons() (after 3+ failures)
        └── Write failure-lessons-{category}.md to context/

Recovery result: { success, shouldRetry, actionTaken }
    ├── If retryable → AutoModeService.retry() with injected context
    └── If escalate → EscalationRouter.routeSignal()
        ├── Dedup check (30-min window)
        ├── Rate limit check (per-channel)
        └── Send to registered channels
            └── EscalationLogEntry recorded for audit trail

(Parallel) FeatureHealthService.audit()
    └── Check for drift: orphaned refs, stale running, merged branches
    └── Auto-fix if enabled → Update feature status

Lead Engineer State Machine
    ├── [EXECUTE] Load sibling reflections from trajectory store
    ├── [REMEDIATION] Inject failure context + review feedback
    └── [ESCALATE] FailureClassifierService maps reason → category

Built by protoLabs — Open source on GitHub