Skip to content

Assertions Reference

Assertions are checks that verify LLM behavior during Arena test scenarios. They run after each turn (or across the full session) and determine whether the response meets expectations.

sequenceDiagram
participant Scenario
participant LLM
participant Assertion
participant Result
Scenario->>LLM: User Turn
LLM-->>Scenario: Assistant Response
loop For Each Assertion
Scenario->>Assertion: Check Response
Assertion->>Assertion: Evaluate
Assertion-->>Result: Pass/Fail + Details
end
Result->>Scenario: Aggregate Results

All assertions follow this structure:

assertions:
- type: assertion_name # Required: Assertion type
params: # Required: Type-specific parameters
param1: value1
param2: value2
message: "Description" # Optional: Human-readable description
when: # Optional: Conditional filtering
tool_called: "tool_name"
pass_threshold: 0.8 # Optional: Required pass rate for trial runs (0.0-1.0)

Fields:

  • type: The check type to use (see Checks Reference for all available types)
  • params: Parameters specific to the check type
  • message: Optional description shown in reports
  • when: Optional conditions that must be met for the assertion to run (see Conditional Filtering)
  • pass_threshold: Optional pass rate threshold when using trials (default: 1.0 = all must pass)

The table below lists the most commonly used assertion types. For full details and parameters, see the Checks Reference.

TypeDescription
content_includesResponse contains specific text patterns (case-insensitive)
content_excludesResponse does not contain forbidden text
content_matchesResponse matches a regular expression
tools_calledSpecific tools were invoked during the turn
tools_not_calledSpecific tools were not invoked
llm_judgeLLM evaluates response quality against criteria
json_schemaResponse conforms to a JSON Schema
no_tool_errorsAll tool calls completed without errors
tool_call_chainTools were called in a specific order

Assertions can be scoped to a single turn or to the entire conversation session.

Turn-level assertions are declared on individual turns and check only that turn’s response:

turns:
- role: user
content: "What is the capital of France?"
assertions:
- type: content_includes
params:
patterns: ["Paris"]
message: "Should mention Paris"

Session-level assertions are declared at the scenario level using conversation_assertions and evaluate across the full conversation:

conversation_assertions:
- type: llm_judge_session
params:
criteria: "The assistant maintained a helpful tone throughout"
min_score: 0.8
message: "Overall tone check"
- role: user
content: "What is the capital of France?"
assertions:
- type: content_includes
params:
patterns: ["Paris"]
message: "Should mention Paris"
- role: user
content: "Get the weather in NYC"
assertions:
- type: tools_called
params:
tools: ["get_weather"]
message: "Should call the weather tool"
- role: user
content: "Explain quantum computing to a child"
assertions:
- type: llm_judge
params:
criteria: "The explanation is age-appropriate, avoids jargon, and uses analogies"
min_score: 0.7
message: "Should be understandable by a child"
- role: user
content: "Return the order details as JSON"
assertions:
- type: json_schema
params:
schema:
type: object
required: ["order_id", "status"]
properties:
order_id:
type: string
status:
type: string
enum: ["pending", "shipped", "delivered"]
message: "Response should be valid order JSON"

The optional when field on any assertion specifies preconditions that must be met for the assertion to run. If any condition is not met, the assertion is skipped (recorded as passed with skipped: true) — not failed. This is particularly useful for cost control with expensive assertions like LLM judges.

FieldTypeDescription
tool_calledstringAssertion runs only if this exact tool was called in the turn
tool_called_patternstringAssertion runs only if a tool matching this regex was called
any_tool_calledbooleanAssertion runs only if at least one tool was called
min_tool_callsintegerAssertion runs only if at least N tool calls were made

All conditions are AND-ed: every specified field must be satisfied.

Example — Only judge when a specific tool was called

Section titled “Example — Only judge when a specific tool was called”
- role: user
content: "Search for recent papers on AI safety"
assertions:
- type: llm_judge_tool_calls
when:
tool_called: search_papers
params:
criteria: "Search queries should be well-formed and specific"
min_score: 0.7
message: "Search quality check"

If search_papers was not called in this turn, the assertion is skipped entirely — no judge LLM call is made.

assertions:
- type: tool_call_chain
when:
any_tool_called: true
min_tool_calls: 2
params:
steps:
- tool: lookup_customer
- tool: process_order
message: "Multi-step flow check (only when 2+ tools called)"

When a when condition is not met, the assertion result appears in reports as:

{
"type": "llm_judge_tool_calls",
"passed": true,
"skipped": true,
"message": "Search quality check",
"details": {
"skip_reason": "tool \"search_papers\" not called"
}
}

Note: In duplex/streaming paths where tool trace data is unavailable, when conditions pass unconditionally — the assertion runs and the validator itself decides how to handle the missing trace (typically by skipping).

When running scenarios with multiple trials, pass_threshold controls how many trials must pass for the assertion to be considered successful overall:

assertions:
- type: content_includes
params:
patterns: ["recommendation"]
pass_threshold: 0.8 # 80% of trials must pass
message: "Should usually include a recommendation"
  • Default: 1.0 (all trials must pass)
  • Range: 0.0 to 1.0
  • Useful for non-deterministic LLM outputs where some variance is acceptable
  1. Be specific — Avoid vague patterns like "help". Use meaningful text that signals correct behavior.

  2. Always add messages — The message field appears in reports and makes failures easy to diagnose.

  3. Test both positive and negative cases — Verify that the LLM calls the right tools and avoids calling wrong ones.

  4. Use the right assertion type — Prefer content_includes for simple text checks; use content_matches only when you need regex.

  5. Gate expensive assertions with when — Place when conditions on LLM judge assertions to avoid unnecessary API calls.

  6. Keep assertions per turn to 3-5 — Too many assertions make failures hard to diagnose and slow down execution.

  7. Place cheap assertions before expensive ones — Content checks and tool call checks should come before LLM judges.