Research PaperJanuary 2026

Pathfinder: Self-Improving Agent Trace Analysis via Adversarial Self-Play

Dhruv Atreja — dhruv@theaiagentco.com

Abstract

As AI agents transition from demos to production, debugging them becomes a critical bottleneck. Traditional observability tools—built for deterministic software—fail to capture the nuanced failures of LLM-based systems: context drift, infinite loops, hallucinated tool calls, and silent degradation. We present Pathfinder, a self-improving agent trace analyzer that achieves 87.2% detection accuracy across 50 deficiency types by treating trace analysis as code generation rather than retrieval.

Pathfinder writes and executes SQL queries, bash pipelines, and Python scripts to analyze traces—enabling precise, compositional analysis that embedding-based approaches cannot express. We train Pathfinder via adversarial self-play: the model alternates between injecting realistic deficiencies (grounded in real bugs from CAMEL, SWE-agent, Open Deep Research, and Qwen-Agent) and detecting them, creating an automatic curriculum without human annotation.

Key Results

87.2%
Detection Accuracy
Across 50 deficiency types
35.4%
vs RAG Baseline
Improvement over embedding-based retrieval
50
Failure Types
Comprehensive taxonomy from real bugs
5.3%
Cross-Agent Drop
Strong generalization to new frameworks

Key Contributions

Code Execution over Retrieval

SQL, bash, and Python queries enable precise, compositional analysis that embedding-based methods cannot express. Schema-agnostic approach scales from single traces to patterns across thousands of runs.

Grounded Deficiency Taxonomy

50 failure types derived from real production bugs in major agent frameworks (CAMEL, SWE-agent, Open Deep Research, Qwen-Agent), covering parsing errors, context mismanagement, async race conditions, and architectural anti-patterns.

Self-Play Reinforcement Learning

A single-model training regime where the model alternates between injecting deficiencies and detecting them, generating unlimited training signal without human annotation.

How It Works

Unlike traditional approaches that embed traces and retrieve similar examples, Pathfinder generates and executes code to analyze traces directly.

pathfinder_query.sql
-- Find traces where context was truncated mid-conversation
SELECT trace_id, step_number,
       LENGTH(messages) as msg_length,
       token_count
FROM agent_steps
WHERE token_count > 0.9 * context_limit
  AND step_type = 'llm_call'
  AND EXISTS (
    SELECT 1 FROM agent_steps s2
    WHERE s2.trace_id = agent_steps.trace_id
      AND s2.step_number > agent_steps.step_number
      AND s2.messages NOT LIKE '%' ||
          SUBSTR(agent_steps.messages, 1, 100) || '%'
  );

This SQL query identifies context truncation issues by finding LLM calls near the token limit where subsequent messages lose reference to earlier content—a pattern that would require complex semantic reasoning with embedding-based approaches.

Deficiency Taxonomy

Our 50-type taxonomy covers failures extracted from real PRs in production agent frameworks, organized into 8 categories with varying detection difficulty.

Parsing & Encoding
5 types
94.3%
accuracy
Streaming & Response
4 types
91.2%
accuracy
Tool Schema Issues
5 types
89.7%
accuracy
Configuration
4 types
88.9%
accuracy
Architecture & Control
10 types
82.5%
accuracy
Context & Token Mgmt
10 types
78.4%
accuracy
Prompt & Instruction
7 types
76.2%
accuracy
Async & Concurrency
5 types
72.1%
accuracy

Self-Play Training

Inspired by Self-Play SWE-RL, we train a single model that alternates between two roles, creating an automatic curriculum without human annotation.

Injector Role

Modifies working agent codebases to introduce realistic deficiencies—grounded in actual bug-fix PRs from production frameworks. Uses template-based injection from real bug patterns.

Detector Role

Analyzes execution traces using SQL, bash, and Python to identify and localize injected deficiencies. Rewarded for correct detection and precise localization.

Training Dynamics

Step 0
52.1%
12.3 types
Step 2.5K
68.7%
24.1 types
Step 5K
78.2%
35.8 types
Step 10K
87.2%
47.8 types

Experience Pathfinder in Production

Pathfinder powers the detection engine behind The AI Agent Co's monitoring platform. Start finding issues in your agent traces today.