What is agent session replay?
A single agent run is a sequence of decisions: the model thinks, calls a tool, reads the result, thinks again, calls another tool, and eventually stops. Each of those moves emits an event - a tool call, a model request, a hook fire, an error. Agent session replay takes the events from one run and reassembles them into the structure they actually had: a trace, rendered as a rail of spans on a waterfall timeline.
Instead of grepping a log file for the run id and stitching the order back together in your head, you open the session and the whole run is already laid out front to back. You step through it span-by-span. You expand any span to see the model call or the tool call underneath it - the exact request that went out and the exact payload that came back. Replay is the difference between knowing that a run failed and seeing whereit went wrong, in the agent's own order of operations.
agenteye ingests every run in realtime, so a session is replayable the moment it finishes - and watchable while it's still going on the live event stream. Replay is the part you reach for after the fact, when a run did something you didn't expect and you need to reconstruct the why.
Why you can't debug an agent from logs
Logs were designed for a world where a request comes in, a handful of functions run, and a response goes out. An agent run breaks every assumption that view rests on. A single run can fire hundreds of spans. The model is non-deterministic, so the same prompt produces a different path on the next attempt. And the interesting failures aren't crashes - the agent doesn't throw, it just quietly does the wrong thing for twelve turns.
Three problems make raw logs a dead end for agent debugging:
- interleaving. If more than one run is in flight, their log lines are shuffled together in one stream. Rebuilding a single run means filtering by id and praying the timestamps are ordered the way the agent actually executed.
- no causality.A log line tells you something happened. It doesn't tell you that this tool call was the direct consequence of that model response, or that this error is what sent the agent into a retry loop three spans later.
- truncated payloads. The thing you need is almost always the full request and the full response - the entire tool arguments, the complete model output. Logs truncate exactly the payload that would have told you why the agent decided what it decided.
Session replay fixes all three at once because it isn't reading lines - it's reading the trace. One run, in order, with causality intact and nothing truncated. You scrub the timeline the way you'd scrub a video, and the run plays back as the sequence of decisions it really was.
How to replay a session in agenteye
The agenteye dashboard runs locally at localhost:7771. Once your harness is emitting events, replaying a run is four steps.
1. Open the session. Find the run you care about - from the live event stream, from a dashboard, from an alert, or by running a query over your sessions - and open it. The session view loads the full trace for that single run.
2. Read the trace rail. The run renders as a waterfall: each span is a bar on the timeline, nested under the span that caused it, sized by how long it took. You read top to bottom and you can already see the shape of the run - the long flat stretch where the agent was stuck, the burst of identical calls, the span where the timeline simply stops.
3. Expand a span. Click any span and it opens to its contents. A model-request span shows the prompt that went in and the completion that came back. A tool-call span shows the arguments the model passed and the result the tool returned. A hook-fire span shows which policy ran and what it decided. This is where you stop guessing: the raw payload is right there.
4. Inspect the payload.Read the actual request and response. The file path the agent invented. The arguments that were subtly wrong. The tool result the model then misread. The denial a policy returned. The payload is the ground truth of what the agent saw and what it did with it - and it's the thing you carry into a fix.
Prefer to stay in your editor? The agenteye CLI and MCP pull the same data into a local Claude session - agenteye pull --since 7d hands recent runs to the model so you can ask about a trace without leaving the terminal.
Spotting loops, drift and blocked actions
Replay isn't just for reading a run end to end - it's for recognizing failure shapes on sight. Three of them show up constantly, and each has a signature on the trace rail.
- loops. The same span repeats: identical tool calls back to back, or a model request and an error alternating forever. On the waterfall a loop is unmistakable - a run of nearly identical bars marching down the timeline. Expand two of them and the payloads are the same, which confirms the agent is retrying the exact same failing move rather than trying a different one.
- drift. The agent wanders off the original goal, usually deep into a long plan. In replay you watch the tool calls gradually stop being about the task you asked for. The early spans touch the right files; the later ones touch something unrelated, and the model output between them shows the moment the objective slipped.
- blocked actions. When the failproofai policy layer denies a move - a destructive shell command, a force push, a write to a path it shouldn't touch - that denial is its own span in the trace. Replay shows you exactly when the agent reached for the dangerous action, which policy caught it, and what the agent did next once the action came back denied.
Once you've named the failure mode on the rail, the fix usually names itself: a loop wants a stop condition, drift wants a goal reminder, a repeatedly-attempted dangerous action wants a policy.
Ask the agent: why did this run fail?
Reading a trace by hand is fast, but you don't always have to. agenteye ships a built-in agent that takes a plain-English question - “why did this run fail?” - and does the investigation for you. It writes the query to find the run, reads the trace span-by-span the way you would, and names the failure mode in words: the agent looped on a failing test command, it drifted off the migration task into refactoring, a policy denied an attempted force push on span 41.
Under the hood that's the same toolkit you have manually - the event store, the SQL and visual query builder over events, sessions, and evals, and the trace itself. The built-in agent just drives them for you and hands back a verdict you can act on. When you want to see the evidence, the replay is one click away, already scrolled to the span the agent flagged.
From replay to a policy
Replay is the find half of the loop; failproof ai's point is to close it. Once a session has shown you a failure that will recur - the agent reaching for rm -rf, a tool call that keeps looping, a write into a path it has no business touching - the answer isn't to remember to watch for it next time. It's a policy.
The failproofai CLI ships 39 built-in policies that deny or warn on bad actions at the harness hook layer, with deep audits and entirely local - the dashboard runs at localhost:8020. The denial you just watched a policy make in replay is the same mechanism: the harness fires a hook before the action runs, a policy inspects it, and the bad move never happens. agenteye shows you which failures are worth a policy; the policy layer makes sure they can't happen twice. Find, then fix.
Get started
failproof ai is free and open-source. Install the cli, point your harness at it, and the agenteye dashboard runs at localhost:7771 with the policy dashboard at localhost:8020. Full docs live at docs.befailproof.ai.