Checks Reference

PromptKit has a unified check system: one set of check types usable across three surfaces.

Assertions — validate LLM behavior in Arena test scenarios (assertions: field in scenario YAML).
Guardrails — enforce runtime policy in production (validators: field in pack YAML).
Evals — monitor quality in production (evals: field in pack YAML).

All check types are implemented as EvalTypeHandler instances registered in the runtime/evals/ package. See the Unified Check Model for the conceptual overview.

Surface legend used in the tables below:

Symbol	Surface
A	Assertion
G	Guardrail
E	Eval

“Streaming” indicates whether the check supports incremental evaluation during streaming responses (relevant for guardrails).

Content Checks

Type	Aliases	Params	Surfaces	Streaming
`contains`	`content_includes`	`patterns` (string[])	A G E	No
`regex`	`content_matches`	`pattern` (string)	A G E	No
`content_excludes`	`banned_words`, `content_not_includes`	`patterns` (string[])	A G E	Yes
`contains_any`	`content_includes_any`	`patterns` (string[])	A E	No
`min_length`	—	`min` or `min_characters` (int)	A E	No
`max_length`	`length`	`max` or `max_characters` (int), `max_tokens` (int)	A G E	Yes
`sentence_count`	`max_sentences`	`max` or `max_sentences` (int)	A G E	No
`field_presence`	`required_fields`	`fields` or `required_fields` (string[])	A G E	No
`cosine_similarity`	—	`reference` (string), `min_similarity` (float)	A E	No

When content_excludes is invoked via the banned_words alias, match_mode defaults to word_boundary.

Example — assertion (scenario YAML):

assertions:
  - type: contains
    params:
      patterns: ["thank you", "welcome"]

Example — guardrail (pack YAML):

validators:
  - type: banned_words
    params:
      patterns: ["competitor-name", "internal-only"]

Example — eval (pack YAML):

evals:
  - id: response_length
    type: max_length
    trigger: every_turn
    params:
      max: 500

JSON & Structure Checks

Type	Aliases	Params	Surfaces
`json_valid`	`is_valid_json`, `valid_json`	—	A E
`json_schema`	—	`schema` (object)	A E
`json_path`	—	`expression` (string), `expected`, `contains`, `min_results`, `max_results`	A E

Example:

assertions:
  - type: json_path
    params:
      expression: "$.order.status"
      expected: "confirmed"

Tool Checks (Turn-Level)

These checks evaluate tool usage within a single assistant turn.

Type	Aliases	Params	Surfaces
`tools_called`	`tool_called`	`tool_names` (string[]), `min_calls` (int, default 1), `ignore_validation` (bool), `require_args` (bool)	A G E
`tools_not_called`	—	`tool_names` (string[])	A G E
`tool_args`	—	`tool_name` (string), `expected_args` (object)	A E
`tool_calls_with_args`	—	`tool_name`, `expected_args`, `result_includes`	A E
`tool_call_count`	—	`tool` (string), `min` (int), `max` (int)	A E
`tool_call_sequence`	—	`sequence` (string[])	A E
`tool_call_chain`	—	`chain` (string[])	A E
`tool_anti_pattern`	—	`patterns` (array of `{sequence, message}`)	A E
`tool_no_repeat`	—	`tools` (string[]), `max_repeats` (int)	E
`tool_efficiency`	—	`max_calls`, `max_errors`, `max_error_rate`	E
`no_tool_errors`	—	—	A E
`tool_result_includes`	—	`tool_name`, `patterns` (string[])	A E
`tool_result_matches`	—	`tool_name`, `pattern` (string)	A E
`tool_result_has_media`	—	`tool_name`	E
`tool_result_media_type`	—	`tool_name`, `media_type`	E

Example:

assertions:
  - type: tool_call_sequence
    params:
      sequence: ["lookup_customer", "create_ticket"]

Tool Checks (Session-Level)

These checks evaluate tool usage across the entire session.

Type	Aliases	Params	Surfaces
`tools_called` (session)	`tool_called`	`tool_names` (string[])	A G E
`tools_not_called` (session)	—	`tool_names` (string[])	A G E
`tool_args` (session)	—	`tool_name`, `expected_args`	A E
`tool_args_excluded_session`	`tools_not_called_with_args`	`tool_name`, `excluded_args`	A E

Session-level tool checks use the on_session_complete or on_conversation_complete trigger.

Tool Invocation Checks

Unlike the tool checks above (which evaluate tools the agent already called), this check invokes a tool itself and asserts on the result. Typical use is to run a verification tool — a sandbox’s run_tests, a render-and-diff utility, a custom HTTP probe — as the hard gate after the conversation completes.

`tool_exec`

Invokes a registered tool by name through the runtime tool registry and passes if the call succeeded. The pass condition is:

tools.Registry.Execute returns no error, and
the resulting ToolResult.Error field is empty.

This makes tool_exec a generic “is this tool happy” gate that works with any registered tool — MCP-discovered (e.g. a sandbox’s run_tests), HTTP, local executors, custom client tools. The handler doesn’t know or care about the transport.

Param	Type	Required	Description
`tool`	string	Yes	Registry name of the tool to invoke.
`args`	object	No	Arguments passed verbatim to the tool. Defaults to `{}`.
`timeout_seconds`	int	No	Per-call timeout. Default `120`. Generous because the typical use case is a long-running test suite inside a sandbox.

Surfaces: A E (conversation-level / session-level — invoke at the end of the session, not per turn)

Example — gating on a sandbox’s hidden test suite:

conversation_assertions:
  - type: tool_exec
    params:
      tool: run_tests
    message: "Hidden tests must pass"

Pair this with a source-backed MCP entry that supplies the run_tests tool — the sandbox lives for the session, runs the agent’s edits, and the gate checks them at the end.

Example — pack-shipped validation tool:

conversation_assertions:
  - type: tool_exec
    params:
      tool: validate_invoice
      args:
        strict: true
      timeout_seconds: 30
    message: "Final invoice must validate"

Notes

The host (arena, SDK, …) must inject a *tools.Registry into EvalContext.Metadata["tool_registry"]. Arena does this automatically; SDK consumers using the runtime evals API directly need to populate it themselves.
Because the gate calls a tool, it counts toward whatever cost / side-effect budget the tool implies (e.g. running the test suite costs CPU time, an HTTP probe costs a request).
Errors from the tool surface in the assertion’s Explanation so failures are debuggable from the report without re-running.

Agent & Skill Checks

Type	Params	Surfaces
`agent_invoked`	`agent_names` (string[])	A E
`agent_not_invoked`	`agent_names` (string[])	A E
`agent_response_contains`	`agent_name`, `patterns`	E
`skill_activated`	`skill_names` (string[])	A E
`skill_not_activated`	`skill_names` (string[])	A E
`skill_activation_order`	`sequence` (string[])	A E

Example:

assertions:
  - type: agent_invoked
    params:
      agent_names: ["billing-agent"]

Workflow Checks

Type	Params	Surfaces
`workflow_complete`	—	A E
`workflow_state_is`	`state` (string)	A E
`workflow_transitioned_to`	`state` (string)	A E
`workflow_transition_order`	`sequence` (string[])	A E
`workflow_tool_access`	`rules` (array of `{state, allowed}`)	A E

Example:

assertions:
  - type: workflow_transition_order
    params:
      sequence: ["triage", "investigation", "resolution"]

Media Checks

Type	Params	Surfaces
`image_format`	`formats` (string[])	A E
`image_dimensions`	`min_width`, `max_width`, `min_height`, `max_height`	A E
`audio_format`	`formats` (string[])	A E
`audio_duration`	`min_seconds`, `max_seconds`	A E
`video_duration`	`min_seconds`, `max_seconds`	A E
`video_resolution`	`min_width`, `max_width`, `presets` (string[])	A E

LLM Judge Checks

LLM judge checks send the assistant output (or full session) to a language model for evaluation. The judge returns a score (0.0—1.0) and reasoning.

`llm_judge`

Turn-level LLM evaluation. The judge sees the current assistant response and evaluates it against the provided criteria.

Param	Type	Required	Description
`criteria`	string	Yes	What the judge should evaluate
`rubric`	string	No	Detailed scoring guidance
`model`	string	No	Model to use for judging
`system_prompt`	string	No	Override the default judge system prompt
`min_score`	float	No	Minimum score threshold for passing
`extra`	object	No	Additional provider-specific parameters

Surfaces: A E

`llm_judge_session`

Session-level LLM evaluation. The judge sees the full conversation. Alias: llm_judge_conversation.

Same params as llm_judge. Surfaces: A E

`llm_judge_tool_calls`

Evaluates tool usage patterns via an LLM judge.

Param	Type	Required	Description
`criteria`	string	Yes	What the judge should evaluate about tool usage
`tools`	string[]	No	Filter to specific tools

Plus all standard judge params (rubric, model, system_prompt, min_score, extra). Surfaces: A E

Example:

assertions:
  - type: llm_judge
    params:
      criteria: "Response is empathetic and addresses the customer's concern"
      min_score: 0.7

RAG Checks

RAG checks are named eval primitives for retrieval-augmented generation: they score the answer against retrieved context (faithfulness, hallucination), the answer against the question (answer_relevancy), or the retrieved chunks against the question / ground truth (contextual_precision, contextual_recall, contextual_relevancy).

Each handler is a thin wrapper over llm_judge with a hardened default prompt drawn from public DeepEval / Ragas reference implementations (Apache 2.0). The standard judge params (rubric, model, system_prompt, min_score, extra) all apply; supplying system_prompt or criteria overrides the default.

Context sources — every handler that needs retrieved chunks accepts them in three forms:

Form	Example
`contexts: ["chunk-1", "chunk-2"]`	Canonical list form
`context: "single chunk"`	Convenience form for one chunk
`context_field: retrieved_chunks`	Looks up the named key in `evalCtx.Metadata` — use this when a retrieval tool writes chunks to metadata at runtime

`faithfulness`

Scores how directly the answer is supported by the supplied context. Equivalent in name to DeepEval / Ragas faithfulness.

Param	Type	Required	Description
`contexts` \| `context` \| `context_field`	string[] / string / string	Yes (one of)	Retrieved context the answer should be grounded in