Research PaperJanuary 2026

Pathfinder: Self-Improving Agent Trace Analysis via Adversarial Self-Play

Abstract

As AI agents transition from demos to production, debugging them becomes a critical bottleneck. Traditional observability tools—built for deterministic software—fail to capture the nuanced failures of LLM-based systems: context drift, infinite loops, hallucinated tool calls, and silent degradation. We present Pathfinder, a self-improving agent trace analyzer that achieves 87.2% detection accuracy across 50 deficiency types by treating trace analysis as code generation rather than retrieval.

Pathfinder writes and executes SQL queries, bash pipelines, and Python scripts to analyze traces—enabling precise, compositional analysis that embedding-based approaches cannot express. We train Pathfinder via adversarial self-play: the model alternates between injecting realistic deficiencies (grounded in real bugs from CAMEL, SWE-agent, Open Deep Research, and Qwen-Agent) and detecting them, creating an automatic curriculum without human annotation.

Key Results

87.2%

Detection Accuracy

Across 50 deficiency types

35.4%

vs RAG Baseline

Improvement over embedding-based retrieval

Failure Types

Comprehensive taxonomy from real bugs

5.3%

Cross-Agent Drop

Strong generalization to new frameworks

Key Contributions

Code Execution over Retrieval

SQL, bash, and Python queries enable precise, compositional analysis that embedding-based methods cannot express. Schema-agnostic approach scales from single traces to patterns across thousands of runs.

Grounded Deficiency Taxonomy

50 failure types derived from real production bugs in major agent frameworks (CAMEL, SWE-agent, Open Deep Research, Qwen-Agent), covering parsing errors, context mismanagement, async race conditions, and architectural anti-patterns.

Self-Play Reinforcement Learning

A single-model training regime where the model alternates between injecting deficiencies and detecting them, generating unlimited training signal without human annotation.

How It Works

Unlike traditional approaches that embed traces and retrieve similar examples, Pathfinder generates and executes code to analyze traces directly.

pathfinder_query.sql

-- Find traces where context was truncated mid-conversation
SELECT trace_id, step_number,
       LENGTH(messages) as msg_length,
       token_count
FROM agent_steps
WHERE token_count > 0.9 * context_limit
  AND step_type = 'llm_call'
  AND EXISTS (
    SELECT 1 FROM agent_steps s2
    WHERE s2.trace_id = agent_steps.trace_id
      AND s2.step_number > agent_steps.step_number
      AND s2.messages NOT LIKE '%' ||
          SUBSTR(agent_steps.messages, 1, 100) || '%'
  );

This SQL query identifies context truncation issues by finding LLM calls near the token limit where subsequent messages lose reference to earlier content—a pattern that would require complex semantic reasoning with embedding-based approaches.

Deficiency Taxonomy

Our 50-type taxonomy covers failures extracted from real PRs in production agent frameworks, organized into 8 categories with varying detection difficulty.

Parsing & Encoding

5 types

94.3%

accuracy

Streaming & Response

4 types

91.2%

accuracy

Tool Schema Issues

5 types

89.7%

accuracy

Configuration

4 types

88.9%

accuracy

Architecture & Control

10 types

82.5%

accuracy

Context & Token Mgmt

10 types

78.4%

accuracy

Prompt & Instruction

7 types

76.2%

accuracy

Async & Concurrency

5 types

72.1%

accuracy

Self-Play Training

Inspired by Self-Play SWE-RL, we train a single model that alternates between two roles, creating an automatic curriculum without human annotation.

Injector Role

Modifies working agent codebases to introduce realistic deficiencies—grounded in actual bug-fix PRs from production frameworks. Uses template-based injection from real bug patterns.

Detector Role

Analyzes execution traces using SQL, bash, and Python to identify and localize injected deficiencies. Rewarded for correct detection and precise localization.

Training Dynamics

Step 0

52.1%

12.3 types

Step 2.5K

68.7%

24.1 types

Step 5K

78.2%

35.8 types

Step 10K

87.2%

47.8 types

Experience Pathfinder in Production

Pathfinder powers the detection engine behind The AI Agent Co's monitoring platform. Start finding issues in your agent traces today.

Download Paper