What is llm agent reliability?
Define it the way you would define reliability for any other production system: the fraction of runs that complete the requested task correctly, without a human stepping in. For agents this is uncomfortable because the floor is low. Internal data we see at failproof across coding agents puts unassisted end-to-end completion in the 30–60% range for non-trivial tasks, even on the strongest current models. The model is rarely the bottleneck.
The reliability gap nobody talks about
Demos are clean. Production isn't. The gap is everything that sits between “the model produced the right next thought” and “the agent did the right thing in the world” - the tool call layer.
- Wrong target.The model picks a file or function that doesn't exist. The harness happily attempts it. The error gets fed back, the model tries something equally wrong.
- Right call, wrong moment. The agent does the correct thing one step too early or one step too late - pushes a half-finished branch, deletes a file it still needs.
- Loops. The model gets stuck in a read-think-call cycle that never makes forward progress.
None of these are fixable with a longer prompt. They're runtime concerns.
Three things prompts can't fix
Prompts shape probability distributions. They do not impose constraints on what the harness actually executes. Three concrete failures that survive even excellent prompts:
- Hallucinated tool calls. The model invents
src/utils/dateHelper.tsbecause most projects have one. Yours doesn't. The harness still tries. - Drift on long plans.Ten steps in, the agent has forgotten the user said “don't touch the auth module.”
- Destructive shell.
rm -rf node_modulesin the wrong cwd,git push --forceover someone else's commit.
The 3U framework
We wrote up the model we use in detail - retry is not enough: the 3u framework for agentic reliability - but the short version: every reliable agent system has to Uncover the failure (detect it at the hook layer), Understand its category (loop vs. drift vs. destructive vs. hallucination), and Utilize the right mitigation for that category. Observability alone gives you Uncover. Most setups stop there. Reliability requires all three.
Hook-level enforcement
The way you implement Utilize in practice is through harness hooks. Every coding-agent harness exposes them: pre-tool-call, post-tool- call, pre-prompt, on-stop. Hooks are the enforcement layer. Hooks are to agents what pre-commit is to git - the place you actually catch bad behavior before it lands.
Coverage per harness
failproof ai supports first-class hook integration with:
- claude code (anthropic cli + sdk)
- openai codex (cli + agent runtime)
- gemini cli (google gemini)
- github copilot cli
- picode (pi-coding-agent)
- opencode
- cursor agent
Goose and deep agents are coming soon. The 39 built-in policies behave identically across harnesses, so reliability you measure on one rig transfers to another.
Get started
failproof ai is free and open-source. The cli, the policies, and the local dashboard ship in one npm package.