Patterns, not anecdotes
The first time an agent does something strange, you open the trace. You read the tool calls top to bottom, find the one that went sideways, and you have your answer for that run. The problem is that one trace is an anecdote. The question that actually moves the needle is the aggregate one: how often does this happen, to which agent, after which step, and is it getting worse this week than last?
You can't scroll your way to that answer. Eyeballing traces one at a time is fine at ten runs and useless at ten thousand. To see a failure mode - not a failure instance - you need to treat your traces as data and ask questions of the whole population at once. That is what querying agent traces in SQL gives you: a way to count, group, rank and trend across every run instead of inspecting them individually.
agenteye is the observability half of failproof ai. It records everything your agents do and exposes it as queryable tables, so the jump from “this one run looks wrong” to “here are the 240 runs that share the same failure” is a single query away.
The shift is the same one that observability went through for ordinary services a decade ago. Nobody debugs a fleet by SSHing into one box and reading the log file; you query the logs across every box at once and let the aggregate tell you where to look. Agents are no different. A trace is a log of one run's reasoning, and one run's reasoning is rarely the story. The story is in the distribution - the tail of slow tool calls, the agent whose error rate doubled after a prompt change, the eval that quietly started failing on Tuesday. You only see distributions by querying.
The data: events, sessions, evals
agenteye organizes every trace into three tables - three streams you query the same way you'd query any database. There is no object graph to learn and no special query language; it is plain SQL over three flat shapes.
events- one row per low-level action: every tool call, model request, hook fire, and error. This is the finest grain. Columns include the agent, the session it belongs to, the tool or model name, timing, status, and the payload. If you want to know what an agent actually did, you query events.sessions- one row per whole run. A session rolls up the events for a single agent invocation: duration, final status, token usage, and the policies that fired during the run. If you want to know how runs ended, you query sessions.evals- one row per eval score. When you run an eval against a session or a tool call, the score lands here, keyed back to the session and agent. If you want to know howgood the output was, you query evals.
The three share keys - an agent and a session_idthread through all of them - so you join across streams the obvious way. “Which agents fail their evals most often?” is a join from evals to sessions. “What was the model doing right before this run errored?” is a filter on events scoped to one session_id.
The grain is the thing to keep straight. Three tables means three altitudes, and almost every question you ask has a natural one. Fleet health lives in sessions - how many runs finished, how long they took, how many tripped a policy. Behavior lives in events - the actual sequence of tool calls and model requests inside a run. Quality lives in evals. Pick the wrong table and the query gets awkward; pick the right one and it falls out in a line. A good instinct is to write the question in English first - “count of runs” versus “count of tool calls” - because the noun you land on usually names the table.
Writing a query
Open the SQL editor in the dashboard at localhost:7771 and you have a blank query against the three tables. Start with the question every team asks first - where are my errors coming from? Group the error events by agent over the last week and rank them:
SELECT agent, COUNT(*) AS errors
FROM events
WHERE type = 'error'
AND ts > now() - INTERVAL '7 days'
GROUP BY agent
ORDER BY errors DESC
LIMIT 10;That single result replaces an afternoon of scrolling. You now know which agent owns the problem before you open a single trace. From here you drill: add tool to the GROUP BY to see which tool call is failing, or filter to one agent and pivot on the error message to split a noisy bucket into distinct failure modes.
Latency is the other question that only makes sense in aggregate. A single slow run tells you nothing; the p95 across an agent's tool calls tells you whether it's actually slow. The editor supports percentiles directly:
SELECT agent,
percentile_cont(0.95) WITHIN GROUP (ORDER BY duration_ms) AS p95_ms
FROM events
WHERE type = 'tool_call'
AND ts > now() - INTERVAL '24 hours'
GROUP BY agent
ORDER BY p95_ms DESC;And because the policies that fired are recorded on each session, you can ask the inverse of a failure question - not “what broke” but “what did failproof catch.” Listing every run that hit a given policy is a one-liner: SELECT session_id, agent, started_at FROM sessions WHERE policies @> ARRAY['destructive-shell'] ORDER BY started_at DESC; That stitches the policy layer back to the trace, so a block you enforced is one click from the run that triggered it.
None of these need warming up. There is no CREATE INDEX step, no materialized view to refresh, and no waiting for a nightly job to roll the data into a queryable shape. You write the SELECT and the answer comes back in milliseconds because the data is already stored in a form built for exactly this kind of grouping and ranking. That speed is what makes querying an interactive habit rather than a ticket you file and check tomorrow - you ask, you see, you refine, and the loop is fast enough that you keep going until the failure mode is named.
The visual builder
Not every question is worth typing SQL for, and not everyone on the team writes it fluently. agenteye ships a visual query builder alongside the editor: pick a table, add filters as clickable conditions, choose a group-by, choose an aggregate, and the result updates as you go. It is the same query engine underneath - the builder just assembles the SQL for you.
The two are not separate worlds. Anything you build visually you can drop into the editor to refine by hand, and any query you write by hand reads back against the same three tables. Most people start in the builder to find the shape of the question, then switch to SQL when they need a join or a percentile the builder doesn't expose. Both run over events, sessions and evals, and both return in milliseconds.
The builder is also the fastest way to explore a stream you don't know yet. Drop in events, add a filter on type, and watch the distinct tool names populate; you learn the shape of your own data by clicking through it rather than guessing column names. Once you've found a grouping that surfaces something interesting, the “view as SQL” step hands you a query you can parameterize, save, or paste into a code review. It lowers the floor without lowering the ceiling: a teammate who has never written a window function can still answer “which agent errors most this week,” and you can still reach for the full expressiveness of SQL when the question demands it.
Save it to a dashboard or an alert
A good query is rarely a one-off. The moment you find the one that answers “is the agent regressing,” you want to keep looking at it. agenteye lets you save any query and then do one of two things with it.
- pin it to a dashboard. Save the query, give it a chart, and it becomes a live tile. Your errors-by-agent and p95-latency queries stop being things you re-type and become a board you glance at. Dashboards refresh against the same streaming data, so the number you see is current, not a nightly snapshot.
- turn it into an alert.The same saved query can become a threshold. When “errors for agent X in the last hour” crosses a line you set, agenteye fires an alert. You go from querying for a pattern to being told the moment the pattern appears - the query you wrote once now watches the fleet for you.
This is the whole point of treating traces as data. A query you ran to investigate one incident becomes the dashboard tile that shows the trend and the alert that catches the next one. Pair it with session replay and a result row is one click from the full trace rail of the run behind it.
Let the agent write the SQL
You don't have to know the schema cold to get an answer. agenteye has a built-in agent that turns a plain-English question into a query. Type “errors by agent in the last 7 days, top 10” and it writes the SQL against events, runs it, and hands you the result and the query it used - so you can read it, learn the shape, and edit it.
It goes one step further than generation. Ask it about a specific run and it will read the trace and name the failure mode - telling you the run looped, drifted off-goal, or made a bad tool call, rather than just dumping rows. That is the bridge from raw query results to a diagnosis you can act on.
The two halves reinforce each other. The agent is good at the first hop - taking a vague question and producing a query plus a candidate explanation - and SQL is good at the second, where you need the exact count, the precise window, or a join the agent didn't reach for. In practice you bounce between them: the agent gets you to a query that's 90% right, you tighten the WHEREclause by hand, and you save the result. The plain-English entry point means you never sit in front of a blank editor wondering what columns exist, and the generated SQL means you never have to take the agent's word for it - the query is right there to check.
If you'd rather pull the data into your own agent, the agenteye CLI and an MCP server expose the same three streams to local Claude. You can ask Claude to query your traces directly in your editor, and the live event stream means you're always querying current data, not yesterday's export.
Get started
failproof ai is free and open-source. Install the CLI, point it at your harness, and agenteye starts recording queryable traces - the dashboard, SQL editor and visual builder run at localhost:7771.