Core Concepts
Failure Classification

Failure Classification

Maev automatically classifies failures at the end of every session. The classification engine analyzes all captured events and assigns a category, subcategory, severity, and reason.

Failure categories

CategoryDescriptionSeverity
Context exhaustionAgent ran out of context window before completing the taskHigh
RAG failureRetrieval returned empty, irrelevant, or low-quality resultsMedium
Cost anomalySession cost exceeded expected thresholdsLow to Critical
Tool failureA tool or function call returned an error or unexpected resultMedium to High
Infinite loopThe same tool or action was called 5 or more times in sequenceHigh
Goal driftAgent deviated from its original task or objectiveMedium
Prompt injectionUser input attempted to override system instructionsCritical
HallucinationAgent produced fabricated or unsupported informationHigh
Latency spikeA single step took longer than expectedLow to High
Silent failureSession completed but produced no meaningful outputMedium

Severity levels

SeverityMeaning
lowWorth tracking but not urgent
mediumInvestigate when you have time
highInvestigate soon, user experience impacted
criticalInvestigate immediately

Cost thresholds

Cost anomalies are triggered based on the total cost of a session:

  • Over $1.00 per session: high severity
  • Over $5.00 per session: critical severity

Latency thresholds

Latency spikes are triggered based on the duration of a single LLM call within a session:

  • Over 10 seconds: high severity
  • Over 30 seconds: critical severity

Loop detection

Infinite loops are detected when the same tool or function is called 5 or more times consecutively within a session.

How classification works

  1. The session closes (agent exits or sends a session.end event)
  2. All events for the session are loaded
  3. The classification engine runs each rule against the events in priority order
  4. The first matching rule wins and its result is stored
  5. If a failure is found, an alert is sent

Only one failure category is assigned per session. The engine stops at the first match using priority ordering. If you need more granular quality checks, use Inspectors.