Assertions Reference
Assertions are checks that verify LLM behavior during Arena test scenarios. They run after each turn (or across the full session) and determine whether the response meets expectations.
How Assertions Work
Section titled “How Assertions Work”sequenceDiagram participant Scenario participant LLM participant Assertion participant Result
Scenario->>LLM: User Turn LLM-->>Scenario: Assistant Response
loop For Each Assertion Scenario->>Assertion: Check Response Assertion->>Assertion: Evaluate Assertion-->>Result: Pass/Fail + Details end
Result->>Scenario: Aggregate ResultsAssertion Structure
Section titled “Assertion Structure”All assertions follow this structure:
assertions: - type: assertion_name # Required: Assertion type params: # Required: Type-specific parameters param1: value1 param2: value2 message: "Description" # Optional: Human-readable description when: # Optional: Conditional filtering tool_called: "tool_name" pass_threshold: 0.8 # Optional: Required pass rate for trial runs (0.0-1.0)Fields:
type: The check type to use (see Checks Reference for all available types)params: Parameters specific to the check typemessage: Optional description shown in reportswhen: Optional conditions that must be met for the assertion to run (see Conditional Filtering)pass_threshold: Optional pass rate threshold when using trials (default: 1.0 = all must pass)
Common Assertion Types
Section titled “Common Assertion Types”The table below lists the most commonly used assertion types. For full details and parameters, see the Checks Reference.
| Type | Description |
|---|---|
content_includes | Response contains specific text patterns (case-insensitive) |
content_excludes | Response does not contain forbidden text |
content_matches | Response matches a regular expression |
tools_called | Specific tools were invoked during the turn |
tools_not_called | Specific tools were not invoked |
llm_judge | LLM evaluates response quality against criteria |
json_schema | Response conforms to a JSON Schema |
no_tool_errors | All tool calls completed without errors |
tool_call_chain | Tools were called in a specific order |
Session vs Turn-Level Assertions
Section titled “Session vs Turn-Level Assertions”Assertions can be scoped to a single turn or to the entire conversation session.
Turn-level assertions are declared on individual turns and check only that turn’s response:
turns: - role: user content: "What is the capital of France?" assertions: - type: content_includes params: patterns: ["Paris"] message: "Should mention Paris"Session-level assertions are declared at the scenario level using conversation_assertions and evaluate across the full conversation:
conversation_assertions: - type: llm_judge_session params: criteria: "The assistant maintained a helpful tone throughout" min_score: 0.8 message: "Overall tone check"Representative Examples
Section titled “Representative Examples”Content check
Section titled “Content check”- role: user content: "What is the capital of France?" assertions: - type: content_includes params: patterns: ["Paris"] message: "Should mention Paris"Tool usage check
Section titled “Tool usage check”- role: user content: "Get the weather in NYC" assertions: - type: tools_called params: tools: ["get_weather"] message: "Should call the weather tool"LLM judge
Section titled “LLM judge”- role: user content: "Explain quantum computing to a child" assertions: - type: llm_judge params: criteria: "The explanation is age-appropriate, avoids jargon, and uses analogies" min_score: 0.7 message: "Should be understandable by a child"JSON Schema validation
Section titled “JSON Schema validation”- role: user content: "Return the order details as JSON" assertions: - type: json_schema params: schema: type: object required: ["order_id", "status"] properties: order_id: type: string status: type: string enum: ["pending", "shipped", "delivered"] message: "Response should be valid order JSON"Conditional Filtering (when)
Section titled “Conditional Filtering (when)”The optional when field on any assertion specifies preconditions that must be met for the assertion to run. If any condition is not met, the assertion is skipped (recorded as passed with skipped: true) — not failed. This is particularly useful for cost control with expensive assertions like LLM judges.
when Fields
Section titled “when Fields”| Field | Type | Description |
|---|---|---|
tool_called | string | Assertion runs only if this exact tool was called in the turn |
tool_called_pattern | string | Assertion runs only if a tool matching this regex was called |
any_tool_called | boolean | Assertion runs only if at least one tool was called |
min_tool_calls | integer | Assertion runs only if at least N tool calls were made |
All conditions are AND-ed: every specified field must be satisfied.
Example — Only judge when a specific tool was called
Section titled “Example — Only judge when a specific tool was called”- role: user content: "Search for recent papers on AI safety" assertions: - type: llm_judge_tool_calls when: tool_called: search_papers params: criteria: "Search queries should be well-formed and specific" min_score: 0.7 message: "Search quality check"If search_papers was not called in this turn, the assertion is skipped entirely — no judge LLM call is made.
Example — Multiple conditions (AND)
Section titled “Example — Multiple conditions (AND)”assertions: - type: tool_call_chain when: any_tool_called: true min_tool_calls: 2 params: steps: - tool: lookup_customer - tool: process_order message: "Multi-step flow check (only when 2+ tools called)"Skip behavior
Section titled “Skip behavior”When a when condition is not met, the assertion result appears in reports as:
{ "type": "llm_judge_tool_calls", "passed": true, "skipped": true, "message": "Search quality check", "details": { "skip_reason": "tool \"search_papers\" not called" }}Note: In duplex/streaming paths where tool trace data is unavailable, when conditions pass unconditionally — the assertion runs and the validator itself decides how to handle the missing trace (typically by skipping).
Pass Threshold (Trial-Based Testing)
Section titled “Pass Threshold (Trial-Based Testing)”When running scenarios with multiple trials, pass_threshold controls how many trials must pass for the assertion to be considered successful overall:
assertions: - type: content_includes params: patterns: ["recommendation"] pass_threshold: 0.8 # 80% of trials must pass message: "Should usually include a recommendation"- Default:
1.0(all trials must pass) - Range:
0.0to1.0 - Useful for non-deterministic LLM outputs where some variance is acceptable
Best Practices
Section titled “Best Practices”-
Be specific — Avoid vague patterns like
"help". Use meaningful text that signals correct behavior. -
Always add messages — The
messagefield appears in reports and makes failures easy to diagnose. -
Test both positive and negative cases — Verify that the LLM calls the right tools and avoids calling wrong ones.
-
Use the right assertion type — Prefer
content_includesfor simple text checks; usecontent_matchesonly when you need regex. -
Gate expensive assertions with
when— Placewhenconditions on LLM judge assertions to avoid unnecessary API calls. -
Keep assertions per turn to 3-5 — Too many assertions make failures hard to diagnose and slow down execution.
-
Place cheap assertions before expensive ones — Content checks and tool call checks should come before LLM judges.
See Also
Section titled “See Also”- Checks Reference — All check types and parameters
- Unified Check Model — How assertions, guardrails, and evals relate
- Guardrails Reference — Runtime policy enforcement
- Eval Framework — Production eval architecture