December 19, 2024·13 min read

59% of Our Agent Failures Lasted Under 10 Seconds. We debugged them like logic bugs.

Binary success/failure metrics are killing your debugging velocity. The 10-second rule changes everything about how you interpret agent reliability.

failure taxonomyinfra noiseagent observabilityMCPcompletion rateagent-swarm

Hard to swallow pills meme about quick-fail infrastructure noise in agent systems — The uncomfortable operational pill: most red Xs are not reasoning failures.

Last Tuesday, our on-call engineer spent three hours debugging an agent that kept “failing” every eight seconds. She traced the execution graph, checked the LLM prompts, reviewed the tool schemas. The agent looked perfect on paper.

The actual cause? An MCP server in our staging environment was rejecting connections due to a certificate rotation. The agent never made it past the initialization handshake. It failed in 8.2 seconds, and because our dashboard showed a red X next to “completion rate: 73%,” she treated it like a logic bug.

This happens because agent frameworks treat all failures as equal. A network hiccup during session startup and a two-hour agent loop that hallucinates its way into a recursive tool call both count as one failure. One is infrastructure noise. The other is a behavioral bug. Debugging them the same way wastes engineering time and hides real reliability metrics.

The DES-515 Discovery: 59.2% of Failures Are Lies

We ran a seven-day sweep across our production swarm (DES-515) analyzing 652 session failures. The question was simple: how long do failed sessions actually run before dying?

Session Duration Distribution (Failed Sessions)

Duration	Count	Percentage	Classification
< 10 seconds	386	59.2%	Quick-fail
10s - 5 minutes	142	21.8%	Mixed
> 5 minutes	124	19.0%	Behavioral

Data source: DES-515 production sweep, 7 days, n=652 failed sessions

Nearly six in ten failures lasted under ten seconds. These agents did not fail because of bad prompts, incorrect tool usage, or reasoning errors. They failed because the MCP server was warming up, the network hiccuped during the initial context load, or the session runner crashed before processing the first message.

An agent that dies in eight seconds cannot have done meaningful work. It never reached its first tool call. It certainly did not hallucinate or enter an infinite loop. Yet in our old dashboard, it counted the same as an agent that ran for two hours before hitting a context limit.

What Quick-Fails Actually Look Like

Quick-fails have distinct signatures. They cluster around initialization boundaries:

MCP server timeouts: The agent requests a tool list, the server does not respond within the default timeout, and the session runner aborts.
Session runner crashes: The runner OOMs while loading the system prompt or initializing the tokenizer.
Network hiccups during startup: DNS resolution fails for the LLM API endpoint during the first completion request.
Credential expiration: The session starts with expired tokens and fails on the first authenticated tool call.

Notice the pattern: these are infrastructure events, not agent behaviors. They would fail the same way regardless of which LLM you used or what prompt you wrote. Debugging them by checking the agent’s reasoning chain is like debugging a 504 Gateway Timeout by reading your application logs. The error happened upstream.

The Debugging Tax: Why Classification Matters

Without taxonomy, every failure triggers the same investigation ritual.

When you do not classify failures by duration, your debugging workflow looks like this:

See failure alert
Open session trace
Review agent reasoning steps
Check tool outputs
Compare with previous successful runs
Realize the agent never got that far
Check infrastructure logs
Find the network timeout

Steps 3-6 are pure waste. They happen because your monitoring system presented an eight-second infrastructure hiccup and a two-hour logic bug as identical red Xs. Over the course of DES-515, we estimate our team spent approximately 40 hours debugging quick-fails as if they were behavioral failures.

The cost is not just time. It is cognitive load. When your failure dashboard is 60% noise, engineers develop alert fatigue. They start ignoring failure notifications because it is probably just a network blip. Then they miss the real behavioral bugs hiding in the remaining 40%.

The <10s Rule: Simple, Implementable, Transformative

If session_duration < 10s, route to infrastructure monitoring. Full stop.

The fix is embarrassingly simple. We implemented a classification rule in our session ingestion pipeline:

interface SessionClassification {
  sessionId: string;
  durationMs: number;
  failureType: "INFRA_QUICK_FAIL" | "BEHAVIORAL_FAILURE" | "TIMEOUT";
  routing: "infra_team" | "agent_team";
}

function classifySession(session: Session): SessionClassification {
  const durationSec = session.durationMs / 1000;

  if (durationSec < 10) {
    return {
      ...session,
      failureType: "INFRA_QUICK_FAIL",
      routing: "infra_team",
    };
  }

  if (session.terminationReason === "MAX_STEPS_EXCEEDED") {
    return {
      ...session,
      failureType: "BEHAVIORAL_FAILURE",
      routing: "agent_team",
    };
  }

  // Additional classification logic...
}

This single conditional changes everything about your failure dashboard. Quick-fails route to the infrastructure team’s PagerDuty rotation, not the agent engineers. They get batched and analyzed as reliability metrics, not individual debugging sessions. The 59% noise floor disappears from your behavioral debugging queue.

Infra-Adjusted Completion Rate: The KPI That Actually Matters

Raw completion rate is a vanity metric. Here is the math from DES-515:

Completion Rate Calculation

Total Sessions:2,847

Failed Sessions (raw):652

Raw Completion Rate:77.1%

Quick-fails (<10s):-386

Infra-Adjusted Failures:266

Infra-Adjusted Completion Rate:90.7%

The raw metric said 77% reliability. The adjusted metric said 91%. That is a 30-40 percentage point difference in how you perceive your system’s health. When we presented the infra-adjusted rate to stakeholders, the conversation shifted from “why is our agent so unreliable?” to “how do we reduce MCP server startup latency?” That is a much more tractable problem.

Your SLA should be based on infra-adjusted completion rate. Your autoscaling policies should ignore quick-fails. Your debugging runbooks should route them to infrastructure monitoring, not agent engineers.

The HTTP Status Code Analogy: 30 Years of Solved Problems

In 1994, web servers started returning three-digit status codes. 504 Gateway Timeout tells you the load balancer could not reach the upstream. 500 Internal Server Error tells you the application crashed. You do not debug them the same way. You do not alert on them the same way. You do not count them in the same reliability metrics.

Agent frameworks are unlearning this lesson. Most ship a binary success/failure KPI as if a timeout during tool discovery and a hallucination-induced infinite loop deserve the same classification. They do not. The diagnostic path differs completely:

Quick-fail (infra): Check MCP server health, network connectivity, runner resource limits.
Behavioral failure: Check prompt engineering, tool schemas, reasoning chain, context window usage.

We need failure taxonomy as a first-class feature in agent frameworks. Not as an afterthought. Not as a custom attribute you can attach if you write the code. As a core primitive, like HTTP status codes.

What Does Not Work: Complex ML Classification

We tried using an LLM to classify failures. Feed in the session logs, ask whether this was infrastructure or behavioral. It failed for obvious reasons: the LLM cannot distinguish an MCP timeout (infrastructure) from a tool that intentionally takes eight seconds to respond (behavioral but slow). It also added 2-3 seconds of latency to every failure analysis, which matters when you are processing thousands of sessions.

We also tried manual tagging. Engineers would tag failures after debugging them. This worked for about 48 hours before the backlog grew too large. Human classification does not scale to swarm volumes. The 10-second rule requires zero human intervention and zero ML inference. It is deterministic, fast, and correct enough.

Do not over-engineer this. You are not building a general-purpose failure classifier. You are separating never-started from started-but-failed. Duration is a perfect proxy for that distinction.

Implementation: Adding Taxonomy to Your Pipeline

Here is how we modified our session event consumer to support infra-adjusted metrics:

export function processSessionMetrics(session: SessionEvent): MetricsPayload {
  const durationSec = (session.endedAt - session.startedAt) / 1000;
  const isQuickFail = durationSec < 10 && session.status === "failed";

  return {
    session_id: session.id,
    raw_completion_status: session.status,
    failure_category: isQuickFail ? "infra_noise" : "behavioral",
    infra_adjusted_status: isQuickFail ? "excluded" : session.status,
    team_routing: isQuickFail ? "platform_sre" : "agent_engineering",
    duration_seconds: durationSec,
    termination_reason: session.terminationReason,
    first_tool_call_latency: session.firstToolCallAt
      ? (session.firstToolCallAt - session.startedAt) / 1000
      : null,
  };
}

Then in your dashboard queries, filter by failure_category:

-- Raw completion rate (includes noise)
SELECT
  COUNT(*) FILTER (WHERE status = 'completed') * 100.0 / COUNT(*)
FROM sessions;

-- Infra-adjusted completion rate (signal only)
SELECT
  COUNT(*) FILTER (WHERE status = 'completed') * 100.0 /
  COUNT(*) FILTER (WHERE failure_category != 'infra_noise' OR status = 'completed')
FROM sessions;

This approach also enables proper alerting thresholds. We alert on quick-fail rates exceeding 5% of total traffic (infrastructure problem), while behavioral failure alerts trigger on different thresholds based on the specific agent’s complexity.

The Prediction: Failure Taxonomy Becomes Mandatory

Within 12 months, every serious agent framework will treat failure taxonomy as a core feature, not a plugin. We will see standardized error categories: INFRA_INIT_TIMEOUT, INFRA_MCP_UNAVAILABLE, AGENT_MAX_STEPS, and AGENT_CONTEXT_LIMIT.

Binary success/failure KPIs will be recognized as production-unfit, the same way we view web servers that do not differentiate between 504 and 500. The sub-10-second rule, or similar duration-based heuristics, will become standard practice for separating infrastructure noise from behavioral signals.

Until then, implement it yourself. Check your own session data. We suspect you will find a similar 50-60% quick-fail rate. Stop debugging network hiccups as if they were reasoning errors. Your sanity, and your infra-adjusted completion rate, will thank you.

FAQ

What is a quick-fail in agent systems?

A session failure completing in under 10 seconds, typically caused by infrastructure issues like MCP server timeouts, network hiccups, or runner crashes before the agent executes meaningful logic.

Why use 10 seconds as the classification threshold?

Empirical data shows 59.2% of failures occur under 10 seconds, and agents need at least 10 seconds to initialize context and make their first tool call. Anything faster is infrastructure noise.

How do you calculate infra-adjusted completion rate?

Subtract quick-fails from total failures before calculating completion percentage. This typically reveals a materially higher real completion rate than raw metrics suggest.

What monitoring tools support failure taxonomy?

Most APM tools support custom dimensions. Implement the sub-10-second classification rule in your ingestion pipeline, tagging sessions as infrastructure noise versus behavioral failure for proper routing and alerting.

/ keep reading

All posts

July 15, 2026 / 13 min read

59% of Our Agent Failures Lasted Under 10 Seconds. We debugged them like logic bugs.

The DES-515 Discovery: 59.2% of Failures Are Lies

Session Duration Distribution (Failed Sessions)

What Quick-Fails Actually Look Like

The Debugging Tax: Why Classification Matters

The <10s Rule: Simple, Implementable, Transformative

Infra-Adjusted Completion Rate: The KPI That Actually Matters

Completion Rate Calculation

The HTTP Status Code Analogy: 30 Years of Solved Problems

What Does Not Work: Complex ML Classification

Implementation: Adding Taxonomy to Your Pipeline

The Prediction: Failure Taxonomy Becomes Mandatory

FAQ

What is a quick-fail in agent systems?

Why use 10 seconds as the classification threshold?

How do you calculate infra-adjusted completion rate?

What monitoring tools support failure taxonomy?

Nobody Prompt-Injected Our Agents — They Escalated Their Own Privileges

26 Tool Calls, One Script, $0.02: Measuring “Code Mode” in Production

Multi-Agent Systems Reproduce Every Organizational Anti-Pattern You Already Hate

Build your swarm tonight.

The DES-515 Discovery: 59.2% of Failures Are Lies

Session Duration Distribution (Failed Sessions)

What Quick-Fails Actually Look Like

The Debugging Tax: Why Classification Matters

The <10s Rule: Simple, Implementable, Transformative

Infra-Adjusted Completion Rate: The KPI That Actually Matters

Completion Rate Calculation

The HTTP Status Code Analogy: 30 Years of Solved Problems

What Does Not Work: Complex ML Classification

Implementation: Adding Taxonomy to Your Pipeline

The Prediction: Failure Taxonomy Becomes Mandatory

FAQ

What is a quick-fail in agent systems?

Why use 10 seconds as the classification threshold?

How do you calculate infra-adjusted completion rate?

What monitoring tools support failure taxonomy?

Related field notes

Nobody Prompt-Injected Our Agents — They Escalated Their Own Privileges

26 Tool Calls, One Script, $0.02: Measuring “Code Mode” in Production

Multi-Agent Systems Reproduce Every Organizational Anti-Pattern You Already Hate

Build your swarm tonight.