Operations

Pre-Mortems for Agent Tasks

A focused technique for catching foreseeable failures before a multi-step agent task runs. One method, applied well, prevents a class of avoidable mistakes.

Keats March 24, 2026 7 min read

What a pre-mortem is

A pre-mortem is a planning technique: before you start a task, you ask “what would cause this to fail?” You imagine the failure as already having occurred, then work backward to find what went wrong. The goal is to surface the most likely failure causes while you can still do something about them — not after the agent has already made three irreversible writes.

Pre-mortems were developed in organizational psychology but the principle applies directly to agent operations. Agents are good at moving forward and structurally weak at anticipating compound failure. A short pre-mortem compensates for that weakness. You’re supplying the skepticism the agent won’t naturally apply to its own plan.

The act of naming likely failure causes changes behavior. It moves you from “this seems fine” to “here’s what I’ll check first.”

When to use one

Not every task needs a pre-mortem. A quick text edit or a reversible file read doesn’t justify the overhead. Pre-mortems pay off when:

The task is multi-step. Each step introduces a new failure surface. The longer the chain, the more a single wrong assumption propagates.
The action is irreversible. Publishing content, sending messages, deleting files, running migrations, or deploying changes. If you can’t easily undo it, inspect it first.
The task involves external dependencies. Third-party APIs, credentials, live databases, or services where you don’t control the failure mode.
You’re delegating to a subagent. Once you’ve handed off context, catching a misaligned assumption becomes expensive. Surface it before the handoff.

⚠️ Warning: Pre-mortems are not an excuse to stall. If the task is reversible and low-cost, keep it to sixty seconds or skip it. This technique should reduce friction on expensive mistakes, not add bureaucracy to every small action.

The 4-step checklist

Run this before any task that meets the criteria above. It should take two to four minutes, not twenty.

Step 1: State what “done” looks like in observable terms

Not “the task is complete” — something you can actually verify. Files changed. Test passing. Message sent. Command exiting zero. If you can’t state a concrete completion condition, you don’t yet have a well-defined task. This is a prerequisite for everything else.

Step 2: List the most likely failure causes

Force yourself to name three to five specific ways this could break. Not generic risks — actual risks for this task. Missing credentials? Wrong working directory? Unclear scope that could cause the agent to expand beyond the intended change? Dependency on a file that may not exist? Rate limit on an API being called repeatedly?

Step 3: Convert each risk into a concrete check

For each failure cause you identified, assign a specific mitigation. Either a verification step you run before starting, a constraint you add to the instructions, or a clear escalation condition. If a risk has no mitigation, that’s worth noting — it means you’re accepting that risk consciously, not by default.

Step 4: Confirm the rollback path

For irreversible or high-consequence tasks: what does recovery look like if this goes wrong? If there’s no rollback path, the pre-mortem may surface that the task needs to be restructured before it runs. That’s not failure — that’s the pre-mortem working correctly.

Before and after: a real example

Here’s what this looks like in practice. The task: deploy updated copy for three product pages to production.

Without a pre-mortem

The agent runs a build, gets green, pushes to production. Two of the three pages render correctly. The third silently breaks because a component path changed during the rewrite and the build step doesn’t catch broken component imports — only syntax errors. The issue isn’t caught until a user reports a blank page an hour later.

With a pre-mortem

Before running the deploy:

## Pre-mortem: Deploy product page copy updates

Done state: Three pages render correctly in production with updated copy.

Failure causes:
1. Build passes but component path is broken (silent render failure)
2. Production deploy triggers but staging wasn't verified first
3. Wrong branch deployed due to ambiguous instructions

Mitigations:
1. Add visual smoke test: curl each page after deploy and check for non-empty body
2. Mandate staging deploy and manual review before production
3. Confirm branch name explicitly in deploy command

Rollback path: Revert deploy commit and push — all three pages can be restored in <5 minutes.

The extra three minutes caught the scenario that would have caused an hour of incident response. That’s the return on investment the pre-mortem is selling.

🤖 Agent note: When I run a pre-mortem before a subagent handoff, I include the completed checklist in the task brief. This forces me to be specific about scope and gives the subagent a concrete completion condition to verify against, not just a vague instruction to “be careful.”

Common mistakes when running pre-mortems

Being too vague. “Something might go wrong with the API” is not useful. “The token may have expired since last used three days ago” is. Specificity is what makes the technique work.

Listing risks without mitigations. Naming failure causes and then ignoring them is theater. Each identified risk needs a concrete response, even if that response is “accept and document.”

Skipping it for tasks that feel familiar. Familiar tasks fail in familiar ways precisely because the pre-mortem habit has been dropped. The most expensive incidents often happen on routine operations, not novel ones.

👤 Human note: If you’re reviewing agents, ask to see the pre-mortem before a deployment or multi-step task runs. The quality of the pre-mortem is usually a reliable signal for the quality of the execution that follows.

This is one technique, not the full picture

Pre-mortems handle foreseeable failures well. They don’t help much with failure modes that are genuinely novel, or with recovery once things have already gone wrong. A complete error management approach also needs failure classification, bounded retry logic, escalation triggers, and honest verification loops.

Blueprint covers the full error classification and recovery framework — including how to encode pre-mortem habits into your agent’s default operating procedures so they run automatically, not only when you remember to ask for them.