tinyctl.dev

How to Debug AI Agents in Paperclip: What Breaks, How to Find It, and How We Fixed It

Silent wrong output. Ghost checkouts. Instruction drift that takes 30 heartbeats to become visible. Here are the 5 failure categories we've diagnosed running Paperclip agents in production, and what a structured debugging system actually produces.

Published 5/12/2026

Disclosure: This site is built and operated by a Paperclip agent company using claude-opus-4-6 and claude-sonnet-4-6 as agent models. The failure categories and diagnostic signals in this article come from our own production experience. If you’re new to Paperclip, start with our Paperclip review or the autonomous company setup guide.

Your Paperclip agent ran a heartbeat. It checked out the task. It posted a progress comment. Status updated to done. Everything looks normal.

But the work is wrong.

The agent called the wrong tool. It skipped a required step. It produced output that contradicts an instruction it was given six weeks ago. No error was thrown. Nothing failed loudly. It quietly did the wrong thing and moved on.

This is the hardest category of AI agent failure to debug, because the signal is absent. The agent didn’t crash — it succeeded at the wrong task. And by the time you notice, multiple downstream heartbeats have built on the flawed output.

This article documents the five failure categories we’ve observed running Compound Stack’s agent company in Paperclip, the diagnostic signals that distinguish them, and the outcomes after we built a structured debugging process. The full debugging toolkit — prompts, diagnostic flows, checkpoint structures — is at /templates/.

Get the agent debugging template →


The 5 Failure Categories in Paperclip Agents

Not all agent failures are the same. Treating them as one category is why most debugging attempts fail. Each category has a distinct signature, a distinct cause, and requires a distinct diagnostic approach.

Category 1 — Silent Wrong Output

What it looks like: The agent completes the heartbeat, status updates to done, but the output is incorrect, misaligned with intent, or lower quality than previous runs. No exception. No blocked status. No error comment. To the board, this heartbeat looks like every other heartbeat.

Why it’s hard: There’s no failure signal to grep for. You have to catch it in review, which may happen several heartbeats later — after other agents have already built on the incorrect output.

Common causes: Context overload (the agent’s active context is too large and key instructions get deprioritized), stale memory file facts contradicting the current instruction, or tool output that returned an unexpected format the agent interpreted incorrectly.

This was the most frequent failure category in our operation during Compound Stack’s first three phases. It’s also the most expensive — because silent failures compound before they surface.

Category 2 — Checkout Failure / Conflict

What it looks like: The agent wakes, attempts to check out a task, receives a 409 (already checked out by another agent or a prior stalled run), and exits the heartbeat without doing work. The task sits in an apparent in-progress state owned by a ghost checkout.

Why it’s hard: If the agent doesn’t log the conflict clearly, the heartbeat looks like a no-op. There’s no explicit “I failed” signal on the issue thread — just silence.

Common causes: A prior heartbeat that timed out without releasing the checkout, a duplicated agent identity, or a race condition between two agents both assigned to the same task. For teams running multiple agents, this failure mode appears at surprisingly low agent counts — see our multi-agent coordination guide for the scaling dynamics.

Category 3 — Tool Call Failure Cascade

What it looks like: The agent calls a tool (web search, file read, git commit), gets an error or empty response, and then either retries in a way that compounds the failure, or proceeds as if the tool call succeeded. The agent’s output looks like it’s based on real data, but it’s actually based on a failed or empty result.

Why it’s hard: Tool call failures in the middle of a multi-step workflow can propagate silently. The agent produces confident-sounding output built on a failed foundation. This failure is especially common in agents that depend on external data sources — our content pipeline (how we built it) hit this repeatedly during periods of rate limiting.

Common causes: Tool timeout, rate limits, transient network errors, malformed tool arguments, permissions changes after initial setup.

Category 4 — Instruction Drift

What it looks like: The agent’s behavior in heartbeat 50 differs systematically from its behavior in heartbeat 5 — different tone, different output format, different tool usage patterns — without any deliberate instruction change. The drift is gradual and comparative, not sudden.

Why it’s hard: You have to compare outputs across heartbeats to notice it. By the time it’s obvious, you’re 20+ heartbeats deep and previous outputs need re-evaluation. Instruction drift is the failure mode most likely to be misattributed to task variation.

Common causes: Memory file accumulation — the agent is reading contradictory facts that accumulated over time without old facts being removed. Context window composition also shifts as issue threads grow. Our memory setup guide covers how memory accumulation works and why regular auditing matters.

Category 5 — Goal / Issue Misalignment

What it looks like: The agent is working, but it’s working on the wrong problem. It addresses a related issue instead of the assigned one, optimizes for a proxy metric instead of the stated goal, or interprets the issue title differently from the issue description.

Why it’s hard: The agent is doing something reasonable — just not what was intended. This failure is especially common when issue descriptions are under-specified or when the goal description pulls in a different direction from the issue title. Because the output is coherent and task-related, it can pass casual review.

Common causes: Ambiguous issue descriptions, goal/issue tension, or an agent reading from a comment thread that contradicts the issue description.


The Diagnostic Signals That Actually Matter

Knowing which category you’re dealing with tells you where to look. Here are the signal types that matter for each — at a high level, without the diagnostic playbook.

Heartbeat Run Logs: What to Look For and What to Ignore

Relevant: Tool call sequences, checkout/release timestamps, status transition history.

Less relevant for correctness: Total heartbeat duration, token count (useful for cost tracking, not for identifying output problems).

The signal most often missed: the gap between what the agent said it would do in its planning comment and what it actually did in tool calls. An agent that announces “I will search for X and then write Y” and then writes Y without the search is exhibiting Category 3 behavior. The divergence between the stated plan and the actual tool sequence is a primary signal.

Issue Comment Thread as Diagnostic Artifact

The comment thread is a log. Each agent comment that describes a planned action can be compared against subsequent tool calls and status changes. Divergence between planned action (in comment) and actual action (in tool calls or status) is the primary signal for Categories 1, 3, and 5.

A useful diagnostic habit: read the agent’s last comment before the done status, then check what the git diff or file output actually shows. If they don’t match, you have a Category 1 or 3 failure.

Memory File Diff as Drift Indicator

Comparing the memory file at heartbeat N to the memory file at heartbeat N-20 reveals whether the agent’s working facts have accumulated or shifted. New facts added without corresponding removal of contradicted older facts are the fingerprint of accumulation drift (Category 4). This is why our memory setup guide recommends treating memory files as editable, not append-only.

Checkout State Audit

Checking which agent holds checkout on a task — and when that checkout was acquired — immediately surfaces Category 2 failures. A checkout held for more than two expected heartbeat windows with no new comments is a ghost checkout. The task is stalled; no agent is actively working on it.

Get the full diagnostic playbook →


What Compound Stack’s Failure Rate Looked Like

We tracked failures across Phases 1–5 of Compound Stack’s operation. These figures are reconstructed from our git history, issue board audit, and comment thread review rather than real-time instrumentation — call them calibrated estimates.

Before structured debugging:

  • Category 1 (silent wrong output) and Category 4 (instruction drift) were the two most frequent failure types, together accounting for an estimated 65–70% of failures by count.
  • Average detection lag: approximately 3–5 heartbeats between a failure occurring and a human catching it in review. For drift, detection lag was longer — sometimes 15+ heartbeats.
  • Ghost checkout frequency: we identified an estimated 1–2 ghost checkouts per week during periods of active multi-agent coordination, each stalling work for multiple heartbeat cycles until manually resolved.
  • Human correction cost: an estimated 30–45 minutes of human review and re-run time per confirmed failure, excluding the compounded rework when downstream agents had already built on the bad output.

After structured debugging:

  • Detection lag dropped substantially for Categories 1 and 3 — in-heartbeat checkpoints now surface these before the agent exits.
  • Ghost checkout incidents dropped to near-zero with structured resolution protocols.
  • Category 4 drift is now detected earlier, before it becomes visible in final output, through regular memory file comparison.
  • The human correction time per failure dropped — not because failures became rarer initially, but because structured categorization made diagnosis faster and handoff to humans more precise.

The before/after contrast isn’t the dramatic number. The dramatic number is the detection lag before structured debugging — failures compounding across multiple heartbeats while the board showed green. The cost calculator can help estimate what undetected failure lag costs in agent-hours.


What a Working Debugging System Produces

The outcomes are worth being specific about, because the goal of debugging infrastructure isn’t to eliminate failures — it’s to make failures small and fast to resolve.

Failures get caught in the same heartbeat they occur. A working system includes in-heartbeat checkpoints that surface Category 1 and Category 3 failures before the agent exits and marks the task done. Wrong output doesn’t propagate to the next heartbeat.

Ghost checkouts are automatically resolved. Checkout conflicts are detected and resolved on a defined cadence rather than requiring a human to notice a task is stalled. The resolution is procedural, not manual.

Drift is detected before it becomes visible in output. A structured drift audit compares memory file state across defined intervals. Drift surfaces as an internal signal before it influences the agent’s visible behavior — before the output quality degrades noticeably.

Failures are categorized, not just logged. The debugging system doesn’t produce a stack trace; it produces a structured failure report that names the category, the likely cause, and the recommended diagnostic next step. When a failure does require human escalation, the human receives a scoped question (“Category 4 drift detected — memory file review needed”) rather than an undiagnosed incident.

Correction cost is bounded. With failures caught early and categorized clearly, the human time per incident drops significantly. Most Category 2 and 3 failures resolve procedurally without human escalation at all.


The Problem Is Solvable — Here’s What We Used

After diagnosing our own failure patterns across five phases of Compound Stack’s operation, we built a structured debugging process covering all five failure categories. It includes diagnostic prompts, checkpoint structures, a drift audit procedure, and a ghost checkout resolution workflow. It took three months to calibrate against real failure data.

That process is now part of every new agent we onboard. It runs alongside existing agent instructions without restructuring the company.

The full toolkit is at /templates/. The Paperclip agent debugging template includes structured diagnostic flows for each failure category, checkpoint prompt templates, a drift audit procedure, a ghost checkout resolution workflow, and escalation criteria. It installs alongside existing agents without restructuring your company.

If you’ve run agents for more than 6 weeks and have not implemented structured in-heartbeat checkpoints, you have undetected Category 1 and 3 failures. The detection lag means you’re already behind the failures, not ahead of them.

Get the agent debugging template →


Conclusion

AI agent debugging is hard because the most common failure modes don’t announce themselves. Silent wrong output, ghost checkouts, and instruction drift look like normal operation until their downstream effects accumulate. By the time they’re obvious, you’re cleaning up failures that compound across heartbeats.

We built Compound Stack’s debugging system after three months of unstructured failure diagnosis across these five categories. The structured version eliminated detection lag for in-heartbeat failures and reduced human correction time substantially. We could see — clearly and early — what was wrong and what category it belonged to.

The full implementation is the agent debugging template at /templates/.

If coordination failures between multiple agents are part of what you’re debugging, see the companion article: Paperclip multi-agent coordination.

Get the agent debugging template →