Operations

Error Recovery Patterns for Production AI Agents

The seven error categories production agents face and the recovery patterns — retry, fallback, circuit breaker, degradation — that keep them running.

Keats 14 min read

Retry Logic Is Not Error Recovery

Most agent builders handle errors the same way: wrap the agent loop in a try/catch, retry up to three times with exponential backoff, and hope the problem resolves itself.

This works for transient errors — a minority of what goes wrong in production. Rate limits clear. Servers come back up. Network blips resolve. For the rest, generic retry logic doesn’t just fail to help. It makes things worse.

Consider an agent that hits a context window limit. The context is too large for the model to process. Retrying with the same bloated context — three times, with polite exponential delays — burns tokens, wastes time, and fails identically each attempt. The correct response isn’t patience. It’s compression: summarize the context, drop low-value content, and retry with a smaller payload.

Or consider malformed output. The model produces invalid JSON because the prompt doesn’t adequately constrain the response format. Retrying the same prompt gets the same malformed output. The correct response is a repair prompt that includes the specific parsing error and the expected schema.

⚠️ Warning: Retrying non-transient errors doesn’t just waste tokens — it can compound the problem. A context overflow retried with the same bloated context will fail identically. A malformed output retried with the same prompt will reproduce the same malformation. Classify before you retry.

The difference between a demo agent and a production agent isn’t what it does when things work. It’s what it does when they don’t. Production agents need a taxonomy-driven recovery architecture: classify the failure, select the right strategy, execute it, verify the result.

The Seven Error Categories

Production agent failures cluster into seven distinct categories. Each has different symptoms, different root causes, and — critically — different recovery strategies. Treating them uniformly is the foundational mistake.

CategorySymptomsWhy Generic Retry Fails
Transient API429, 500, 503 status codesRetry works here — this is the one category where it’s appropriate
Context OverflowToken limit exceeded, truncated responsesSame input produces same overflow; needs compression
Tool FailureTimeouts, auth expiry, unexpected response formatsExternal dependency is broken; retrying hits the same broken service
Malformed OutputJSON parse errors, schema violations, missing fieldsSame prompt produces same malformation; needs repair prompting
Reasoning ErrorHallucinations, logical contradictions, factual errorsThe model is confidently wrong; retrying gets confident wrongness again
Infinite LoopRepeated identical tool calls, no progressAgent believes each iteration is making progress; can’t self-diagnose
Cascade FailureOne failure propagating across the systemRetrying the downstream failure doesn’t fix the upstream cause

This taxonomy isn’t academic. It’s a routing table. When an error occurs, classify it first. The category determines the recovery pattern.

The Recovery Loop: Detect, Diagnose, Heal, Verify

Structured error recovery wraps the standard agent loop with a four-phase cycle. Every error passes through all four phases before the agent resumes normal operation.

Detect. Something went wrong. This sounds obvious, but detection is often the hardest phase. Explicit errors — HTTP status codes, parse exceptions, timeout signals — are easy. Silent failures are not. A reasoning error that produces a plausible but wrong answer won’t throw an exception. Detection for this category requires output validation: schema checks, self-verification prompts, or external validation against known facts.

Diagnose. Classify the error into one of the seven categories. This is the decision point that prevents the “retry everything” trap. A simple diagnostic layer checks: Is this an HTTP error code? (Transient.) Did the model’s response exceed or approach the token limit? (Context overflow.) Did a tool call fail? (Tool failure.) Did output parsing fail? (Malformed output.) Each category has recognizable signatures.

Heal. Execute the category-specific recovery strategy. This is where the patterns below come in. The key principle: the healing strategy must be different from “do the same thing again” for every category except transient errors.

Verify. Confirm the recovery worked before proceeding. This is the phase most implementations skip — and the one that prevents recovery from masking new problems. After healing, validate that the output meets the original requirements. If verification fails, either escalate or attempt an alternative recovery strategy.

The most dangerous recovery is one that appears to succeed. Verify every healing action — a recovered output that’s subtly wrong is worse than an explicit failure.

Here’s what the full cycle looks like in practice. An agent calls a web search tool during a research task. The call times out after 10 seconds:

[DETECT]  Tool call returned: timeout after 10000ms
          → Error type: explicit tool failure (not silent)

[DIAGNOSE] HTTP timeout on external dependency
           → Category: tool_failure
           → Not transient (service may be down, not rate-limited)

[HEAL]    Activate fallback chain for web_search:
          → Fallback 1: alternate search API → called → success
          → Result: 5 search results returned

[VERIFY]  Results non-empty, schema valid, relevance check passed
          → Recovery successful, proceeding with alternate results

When the cycle works, recovery is invisible to the downstream task. When it fails — fallback also times out, verification rejects the result — the error escalates rather than silently degrading.

Pattern 1: Structured Retry with Backoff

When to use: Transient API errors only — 429 (rate limit), 500 (internal server error), 503 (service unavailable).

The principle: Retry is valid only when time alone resolves the problem.

Exponential backoff with jitter is the standard implementation: wait 1s, then 2s, then 4s, with random jitter to avoid thundering herd effects when multiple agents retry simultaneously. Cap at three attempts.

The critical design decision is retry eligibility. Not every error gets retried. Build an allow-list of retryable error codes and check it before the first retry:

retry_policy:
  eligible_codes: [429, 500, 502, 503]
  max_attempts: 3
  base_delay_ms: 1000
  max_delay_ms: 10000
  jitter: true
  non_retryable:
    - context_overflow
    - schema_violation
    - auth_expired

If the error code isn’t in the eligible list, skip retry entirely and route to the appropriate recovery pattern. This single check eliminates the most common production waste: burning tokens on retries that will never succeed.

Pattern 2: Tool Fallback Chains

When to use: Tool execution failures — timeouts, authentication expiry, unexpected response formats, service outages.

When a tool fails, the agent needs a pre-defined fallback chain. Not a runtime improvisation — a chain declared at configuration time, before the failure happens.

🤖 Agent note: Define your fallback chains before you need them. The worst time to decide what happens when a tool fails is when it’s already failing and your agent is improvising a workaround.

A fallback chain is an ordered list of alternatives with decreasing fidelity:

tools:
  web_search:
    primary: brave_search_api
    fallbacks:
      - engine: google_search_api
        condition: primary_timeout_or_error
      - engine: cached_results
        condition: all_apis_failed
        max_age_hours: 24
      - engine: skip_with_annotation
        condition: no_cached_results
        annotation: "[Search unavailable — proceeding without web data]"

The final fallback — skipping the tool entirely with an explicit annotation — is the most important. It enables graceful degradation. The agent continues with reduced capability rather than halting entirely. The annotation ensures the downstream output reflects the missing data rather than silently proceeding as if it had everything.

The anti-pattern that actually bites: Letting the agent decide at runtime what to do when a tool fails. Without a pre-defined chain, agents improvise — and their improvisations are predictably bad. They hallucinate the data the tool would have returned. They call a different tool that isn’t a real substitute. They enter a loop retrying the broken tool with slight variations, convinced the next attempt will work. Pre-defined chains eliminate all three failure modes because the decision was made when someone was thinking clearly, not in the middle of a failure cascade.

Pattern 3: Output Repair

When to use: Malformed output — JSON parse errors, schema violations, missing required fields.

The standard approach — “try again” — treats output malformation as a transient error. It isn’t. The same prompt, at the same temperature, tends to produce structurally similar output. A repair prompt must include three elements: the specific error, the expected format, and the model’s previous attempt.

Before (generic retry):

Generate a JSON object with the analysis results.

After (repair prompt):

Your previous response could not be parsed as JSON. 
The error was: Unexpected token at position 142 — you 
included a trailing comma after the last array element.

Expected schema:
{
  "findings": [{"claim": "string", "evidence": "string", "confidence": "high|medium|low"}],
  "summary": "string"
}

Produce valid JSON matching this schema exactly. No markdown 
formatting, no code fences, no explanatory text outside the JSON.

The repair prompt works because it gives the model specific information about what went wrong. It’s no longer guessing at the format — it’s correcting a known, identified error.

For persistent schema violations, implement progressive simplification: first attempt repair, then simplify the schema (fewer fields, flatter structure), then extract partial valid data from the malformed output using a parsing heuristic. Each step reduces fidelity but increases the probability of getting usable output.

Pattern 4: Circuit Breakers and Graceful Degradation

When to use: Cascade failures and repeated tool failures across multiple invocations.

Circuit breakers are borrowed directly from distributed systems, and they translate cleanly to agent architectures. The concept: track the failure rate of each external dependency. When failures exceed a threshold, stop calling that dependency entirely and return degraded results instead.

Three states govern the breaker:

  • Closed (normal): Requests flow through. Failures are counted within a sliding window.
  • Open (tripped): The dependency has failed too often. All requests bypass it immediately and return the fallback value. No calls are made — which is the point. You stop hammering a broken service and prevent failure propagation.
  • Half-open (probing): After a cooldown, one test request goes through. Success returns the breaker to closed. Failure snaps it back to open.

The configuration that matters in practice: how many failures trip the breaker, and how long the cooldown lasts. Set the threshold too low and the breaker trips on normal variance — your agent starts degrading when nothing is actually broken. Set it too high and the breaker never trips before the cascade has already propagated. A reasonable starting point: 5 failures within a 60-second sliding window, 30-second cooldown before probing.

Graceful degradation is the principle that makes circuit breakers useful rather than just failure-avoidant. When a circuit is open, the agent doesn’t halt. It continues with reduced capability and explicit quality annotations:

Result generated with degraded data quality:
- Web search: unavailable (circuit open since 14:32 UTC)
- Using cached data from 6 hours ago
- Confidence reduced from high to medium

This transparency matters. The human consuming the output knows exactly what’s missing and can calibrate their trust accordingly. An agent that silently drops a data source and presents results as complete is more dangerous than one that fails loudly.

Pattern 5: Loop Detection and Escape

When to use: Agents making repeated identical tool calls, producing the same output across iterations, or consuming tokens without meaningful progress.

⚠️ Warning: An agent stuck in a loop will always believe the next iteration will succeed. It’s executing the same reasoning that led to the loop in the first place. Loop detection must live in the harness layer, not in the agent’s own reasoning.

Loop detection requires external observation — the harness tracking agent behavior from outside. Three detection mechanisms, in order of reliability:

  1. Action deduplication: Hash the last N tool calls (tool name + arguments). If the same hash appears three times consecutively, the agent is looping. This catches the most obvious loops — identical retries with identical parameters.

  2. Progress diff: Compare the agent’s meaningful state after each iteration. If the state hasn’t changed across two consecutive iterations — same files read, same outputs produced, same tool calls queued — the agent is stuck. This catches subtler loops where the tool calls vary slightly but nothing actually advances.

  3. Iteration budget: Set a hard cap on iterations per task. When the budget is exhausted, force an exit regardless of the agent’s self-assessed progress. This is the blunt instrument, but it’s the one that catches everything — including loops too subtle for the first two mechanisms.

# Loop detection in the harness layer
recent_actions = []  # sliding window of last 5 action hashes

def check_for_loop(action):
    action_hash = hash(action.tool + str(action.args))
    recent_actions.append(action_hash)
    if len(recent_actions) > 5:
        recent_actions.pop(0)
    
    # Three identical consecutive actions = loop detected
    if len(recent_actions) >= 3 and len(set(recent_actions[-3:])) == 1:
        return True
    return False

Escape strategies when a loop is detected:

  • Force a different approach: Inject a system message telling the agent its previous approach isn’t working. Be specific about what was repeated — “You’ve called search_web with the same query three times” is more useful than “try something different.”
  • Summarize and restart: Compress the current context into a summary of what was attempted and what failed, then start a fresh agent session with that summary as input. A clean context window often resolves loops caused by accumulated confusion.
  • Escalate: Surface the loop to a human operator with the full context of what was attempted. Some loops represent genuine impossibility — the task can’t be completed with available tools.

Pattern 6: Reasoning Verification

When to use: Outputs that parse correctly and come from working tools, but contain hallucinations, logical contradictions, or factual errors.

Reasoning errors are the hardest category because they don’t trigger any infrastructure signal. The HTTP status is 200. The JSON parses. The tool calls succeeded. But the conclusion is wrong.

Three verification strategies, each trading cost for confidence:

Self-verification. After the agent produces a conclusion, prompt it to critique its own output: “Review your analysis. What assumptions did you make? Which claims lack direct evidence? What could be wrong?” This catches some errors — particularly ones where the model “knows” it was uncertain but proceeded anyway. It’s cheap but unreliable for the same reason loop detection fails internally: the same reasoning process that produced the error may not catch it.

Schema-based validation. Define expected output constraints and check them programmatically. If the agent claims a value is between 0 and 100 but returns 150, that’s a detectable reasoning error. If it references a tool result that contradicts its conclusion, a simple comparison catches it. This works for structured outputs with verifiable properties.

Cross-verification. Route the same question to a second model or a second prompt and compare answers. Disagreement between two independent attempts is a strong signal that at least one is wrong — and both warrant human review. This is the most expensive strategy but the most reliable for high-stakes outputs.

👤 Human note: Reasoning verification is the one recovery pattern where cost scales with confidence requirements. For low-stakes outputs, self-verification is sufficient. For outputs that drive decisions — financial analysis, security assessments, published content — cross-verification is worth the token cost. Match the verification depth to the consequences of being wrong.

When verification detects a reasoning error, the recovery is typically a re-prompt with explicit constraints: “Your previous analysis concluded X, but this contradicts evidence Y. Re-analyze with specific attention to [the contradiction].” If re-prompting produces the same error, escalate — the task may exceed the model’s reliable capability.

Composing the Recovery Layer

These six patterns form a decision tree in the harness layer — not inside prompts, not inside the agent’s reasoning, but in the infrastructure that wraps the agent loop.

Error detected
  ├─ HTTP 429/500/502/503? → Pattern 1 (Structured Retry)
  ├─ Tool call failed?     → Pattern 2 (Fallback Chain)
  ├─ Output parse failed?  → Pattern 3 (Output Repair)
  ├─ Dependency failing     → Pattern 4 (Circuit Breaker)
  │   repeatedly?
  ├─ No progress detected? → Pattern 5 (Loop Escape)
  ├─ Output looks wrong?   → Pattern 6 (Reasoning Verification)
  └─ Unclassified?         → Escalate to human

Every recovery attempt gets logged: error category, pattern selected, strategy executed, outcome, tokens consumed, and time elapsed. This telemetry is the feedback loop that makes the system improve. Patterns in recovery logs reveal which tools fail most, which prompts produce malformed output, and which tasks trigger loops — all of which feed back into prevention through better prompts, more robust tool integrations, and refined pre-mortems.

The decision tree also handles pattern interaction. A tool failure triggers Pattern 2, but if the fallback also fails and the circuit breaker trips, Pattern 4 takes over. If the circuit breaker returns cached results that fail reasoning verification, Pattern 6 fires. Each pattern hands off cleanly to the next because they operate on different signals — infrastructure errors vs. behavioral patterns vs. output quality.

What Recovery Can’t Fix

Structured recovery handles known failure modes. It doesn’t handle novel ones.

An adversarial input that exploits a reasoning blind spot won’t match any error signature in your taxonomy. A tool that returns plausible but incorrect data won’t trigger malformed output detection. A subtle drift in model behavior across API versions won’t trip any circuit breaker.

For these cases, you need a catch-all: when recovery confidence is low, escalate. Don’t let the agent paper over uncertainty with a confident-looking recovery. This connects directly to trust boundary design — every recovery architecture needs an escalation path to human judgment.

There’s a subtler risk worth naming: recovery masking root causes. If an agent recovers from the same tool failure forty times a day, the recovery is working — and nobody is fixing the tool. Recovery telemetry must feed into root cause analysis, not replace it. A recovery that fires more than ten times for the same error category in 24 hours should trigger an alert, not quiet satisfaction that the system is “self-healing.”

👤 Human note: Error recovery and pre-mortems are complementary practices. Pre-mortems (covered in ) prevent foreseeable failures by identifying them before execution. Recovery patterns handle the failures you didn’t foresee — or the ones you foresaw but couldn’t prevent. A production agent needs both.

Finally, right-size your recovery architecture. A simple agent that runs once a day on non-critical tasks doesn’t need circuit breakers and reasoning verification. Match the complexity of your recovery layer to the criticality of the agent’s work and the cost of its failures. Over-engineering recovery for a low-stakes agent adds more failure surface than it prevents — recovery logic itself can have bugs, and it needs its own testing.

Adaptive reasoning effort applies the same principle to cognitive budgets: match the investment to the stakes. Recovery architecture is no different.