Dhruv Atreja — dhruv@theaiagentco.com
As AI agents transition from demos to production, debugging them becomes a critical bottleneck. Traditional observability tools—built for deterministic software—fail to capture the nuanced failures of LLM-based systems: context drift, infinite loops, hallucinated tool calls, and silent degradation. We present Pathfinder, a self-improving agent trace analyzer that achieves 87.2% detection accuracy across 50 deficiency types by treating trace analysis as code generation rather than retrieval.
Pathfinder writes and executes SQL queries, bash pipelines, and Python scripts to analyze traces—enabling precise, compositional analysis that embedding-based approaches cannot express. We train Pathfinder via adversarial self-play: the model alternates between injecting realistic deficiencies (grounded in real bugs from CAMEL, SWE-agent, Open Deep Research, and Qwen-Agent) and detecting them, creating an automatic curriculum without human annotation.
SQL, bash, and Python queries enable precise, compositional analysis that embedding-based methods cannot express. Schema-agnostic approach scales from single traces to patterns across thousands of runs.
50 failure types derived from real production bugs in major agent frameworks (CAMEL, SWE-agent, Open Deep Research, Qwen-Agent), covering parsing errors, context mismanagement, async race conditions, and architectural anti-patterns.
A single-model training regime where the model alternates between injecting deficiencies and detecting them, generating unlimited training signal without human annotation.
Unlike traditional approaches that embed traces and retrieve similar examples, Pathfinder generates and executes code to analyze traces directly.
-- Find traces where context was truncated mid-conversation
SELECT trace_id, step_number,
LENGTH(messages) as msg_length,
token_count
FROM agent_steps
WHERE token_count > 0.9 * context_limit
AND step_type = 'llm_call'
AND EXISTS (
SELECT 1 FROM agent_steps s2
WHERE s2.trace_id = agent_steps.trace_id
AND s2.step_number > agent_steps.step_number
AND s2.messages NOT LIKE '%' ||
SUBSTR(agent_steps.messages, 1, 100) || '%'
);This SQL query identifies context truncation issues by finding LLM calls near the token limit where subsequent messages lose reference to earlier content—a pattern that would require complex semantic reasoning with embedding-based approaches.
Our 50-type taxonomy covers failures extracted from real PRs in production agent frameworks, organized into 8 categories with varying detection difficulty.
Inspired by Self-Play SWE-RL, we train a single model that alternates between two roles, creating an automatic curriculum without human annotation.
Modifies working agent codebases to introduce realistic deficiencies—grounded in actual bug-fix PRs from production frameworks. Uses template-based injection from real bug patterns.
Analyzes execution traces using SQL, bash, and Python to identify and localize injected deficiencies. Rewarded for correct detection and precise localization.
Pathfinder powers the detection engine behind The AI Agent Co's monitoring platform. Start finding issues in your agent traces today.