Inspectors

Inspectors are LLM-based evaluators that run on your sessions after they close. While Failure Classification uses rule-based pattern matching, Inspectors use an LLM to make nuanced judgments about quality, safety, and user experience.

Built-in inspectors

Maev includes 5 inspectors that are available to every account:

Inspector	Type	What it checks
User Corrections	Binary	Whether the user had to correct or rephrase the agent's output
User Frustration	Scored	Signs of dissatisfaction, confusion, or repeated requests
Task Completion	Binary	Whether the agent successfully completed the user's primary request
Hallucination Check	Binary	Fabricated facts, citations, or unsupported claims
Safety Check	Binary	Harmful, biased, or policy-violating content

Inspector types

Binary inspectors return a pass/fail result. Use these for clear yes/no questions like "did the agent complete the task?"

Scored inspectors return a score from 0.0 to 1.0. Use these for nuanced evaluations like "how satisfied did the user seem?"

Custom inspectors

You can create your own inspectors from the dashboard. Navigate to Inspectors and click New Inspector.

A custom inspector needs:

Name: shown in session results
Type: binary or scored
Prompt: instructions for the LLM evaluator

Example custom inspector prompt:

You are evaluating an AI agent session. Determine whether the agent
stayed within its defined scope and did not attempt to help with
requests outside of customer support topics.

Respond with a JSON object:
{"passed": true/false, "score": 0.0-1.0, "explanation": "brief reason"}

"passed" = true means agent stayed in scope.
"score" = 1.0 means fully in scope, 0.0 means completely off-topic.

Human corrections

If an inspector produces a wrong result, you can mark it as incorrect from the session detail page. Maev uses your corrections as calibration examples for future evaluations. Over time, your inspectors become more accurate to your specific use case.

When inspectors run

Inspectors run when you trigger them manually from the dashboard or via the API. They do not run automatically on every session by default. This keeps costs predictable.

Inspectors require an OpenAI API key configured on your account. Each inspector evaluation uses one LLM call against the session transcript.

Failure Classification Python SDK