What is agent error recovery?
Recovery is the half of reliability that runs aftera failure has already started to happen. The model has already returned a bad call. The agent has already drifted. Recovery is what determines whether the next thing on the trace is “the agent caught it and continued” or “the agent shipped it.”
In classical software you have try/catchand the failure surface is small. In agents the failure surface is the entire space of things a model can output. Recovery has to be structured around the actual failure shapes, not a generic error type.
The three recovery strategies
Every failure on an agent trace reduces to one of three responses:
- Retry. The failure was transient - a network blip, a rate limit, a flaky upstream - and the next attempt is likely to succeed unchanged. This is the classical playbook and the one most teams overuse.
- Repair. The failure was cognitive - wrong file, wrong function, wrong assumption - and retrying will reproduce it. Repair means returning structured feedback to the agent about what went wrong (and what does exist) so the next attempt is meaningfully different.
- Block. The failure would be irreversible -
git push --force,DROP TABLE,terraform destroy- and no retry or repair is safe. Block, log everything, page a human.
The whole game is picking the right strategy for the failure shape in front of you.
Why agents aren't servers
Server-side error recovery assumes the input is roughly bounded (HTTP request shapes, RPC schemas) and the failure modes are roughly enumerable (timeouts, 5xx, validation errors). That stack was built for a world that doesn't exist anymore. Agents fail in ways the server-era playbook never modeled - they need durability that survives the model itself.
Five failure modes and the right strategy
The mapping we use across all 39 built-in policies:
- bad llm tool call → repair. Block the invocation, surface the directory listing or the function signature the model should have seen, let it retry with truth.
- hallucination → repair. Flag the specific claim, ask the agent to recheck, refuse to continue until it does.
- context drift → repair. Inject a short reminder of the original goal back into the prompt, downweight the off-goal context.
- runaway loop → repair.Detect the repetition, summarize what's been tried, restart from a clean state with the summary as context.
- destructive action → block. Halt the call, log the attempt, surface the result, page if configured.
- transient model/network error → retry. With backoff. This is the only category where retry is the right answer.
Implementation: hooks, local process, 0ms
failproof ai implements all of the above by subscribing to the hooks the harness already exposes. Every supported harness - claude code, codex, gemini cli, github copilot, picode, opencode - calls into failproof on pre-tool, post-tool, and stop events. failproof runs the trace through the policies in a separate local process. The agent loop never waits on us, so the integration adds zero latency to the request path. No data leaves the machine.
Get started
One npm install, three commands, error recovery is on.