Checks Reference
PromptKit has a unified check system: one set of check types usable across three surfaces.
- Assertions — validate LLM behavior in Arena test scenarios (
assertions:field in scenario YAML). - Guardrails — enforce runtime policy in production (
validators:field in pack YAML). - Evals — monitor quality in production (
evals:field in pack YAML).
All check types are implemented as EvalTypeHandler instances registered in the runtime/evals/ package. See the Unified Check Model for the conceptual overview.
Surface legend used in the tables below:
| Symbol | Surface |
|---|---|
| A | Assertion |
| G | Guardrail |
| E | Eval |
“Streaming” indicates whether the check supports incremental evaluation during streaming responses (relevant for guardrails).
Content Checks
Section titled “Content Checks”| Type | Aliases | Params | Surfaces | Streaming |
|---|---|---|---|---|
contains | content_includes | patterns (string[]) | A G E | No |
regex | content_matches | pattern (string) | A G E | No |
content_excludes | banned_words, content_not_includes | patterns (string[]) | A G E | Yes |
contains_any | content_includes_any | patterns (string[]) | A E | No |
min_length | — | min or min_characters (int) | A E | No |
max_length | length | max or max_characters (int), max_tokens (int) | A G E | Yes |
sentence_count | max_sentences | max or max_sentences (int) | A G E | No |
field_presence | required_fields | fields or required_fields (string[]) | A G E | No |
cosine_similarity | — | reference (string), min_similarity (float) | A E | No |
When content_excludes is invoked via the banned_words alias, match_mode defaults to word_boundary.
Example — assertion (scenario YAML):
assertions: - type: contains params: patterns: ["thank you", "welcome"]Example — guardrail (pack YAML):
validators: - type: banned_words params: patterns: ["competitor-name", "internal-only"]Example — eval (pack YAML):
evals: - id: response_length type: max_length trigger: every_turn params: max: 500JSON & Structure Checks
Section titled “JSON & Structure Checks”| Type | Aliases | Params | Surfaces |
|---|---|---|---|
json_valid | is_valid_json, valid_json | — | A E |
json_schema | — | schema (object) | A E |
json_path | — | expression (string), expected, contains, min_results, max_results | A E |
Example:
assertions: - type: json_path params: expression: "$.order.status" expected: "confirmed"Tool Checks (Turn-Level)
Section titled “Tool Checks (Turn-Level)”These checks evaluate tool usage within a single assistant turn.
| Type | Aliases | Params | Surfaces |
|---|---|---|---|
tools_called | tool_called | tool_names (string[]), min_calls (int, default 1), ignore_validation (bool), require_args (bool) | A G E |
tools_not_called | — | tool_names (string[]) | A G E |
tool_args | — | tool_name (string), expected_args (object) | A E |
tool_calls_with_args | — | tool_name, expected_args, result_includes | A E |
tool_call_count | — | tool (string), min (int), max (int) | A E |
tool_call_sequence | — | sequence (string[]) | A E |
tool_call_chain | — | chain (string[]) | A E |
tool_anti_pattern | — | patterns (array of {sequence, message}) | A E |
tool_no_repeat | — | tools (string[]), max_repeats (int) | E |
tool_efficiency | — | max_calls, max_errors, max_error_rate | E |
no_tool_errors | — | — | A E |
tool_result_includes | — | tool_name, patterns (string[]) | A E |
tool_result_matches | — | tool_name, pattern (string) | A E |
tool_result_has_media | — | tool_name | E |
tool_result_media_type | — | tool_name, media_type | E |
Example:
assertions: - type: tool_call_sequence params: sequence: ["lookup_customer", "create_ticket"]Tool Checks (Session-Level)
Section titled “Tool Checks (Session-Level)”These checks evaluate tool usage across the entire session.
| Type | Aliases | Params | Surfaces |
|---|---|---|---|
tools_called (session) | tool_called | tool_names (string[]) | A G E |
tools_not_called (session) | — | tool_names (string[]) | A G E |
tool_args (session) | — | tool_name, expected_args | A E |
tool_args_excluded_session | tools_not_called_with_args | tool_name, excluded_args | A E |
Session-level tool checks use the on_session_complete or on_conversation_complete trigger.
Tool Invocation Checks
Section titled “Tool Invocation Checks”Unlike the tool checks above (which evaluate tools the agent already
called), this check invokes a tool itself and asserts on the
result. Typical use is to run a verification tool — a sandbox’s
run_tests, a render-and-diff utility, a custom HTTP probe — as the
hard gate after the conversation completes.
tool_exec
Section titled “tool_exec”Invokes a registered tool by name through the runtime tool registry and passes if the call succeeded. The pass condition is:
tools.Registry.Executereturns no error, and- the resulting
ToolResult.Errorfield is empty.
This makes tool_exec a generic “is this tool happy” gate that works
with any registered tool — MCP-discovered (e.g. a sandbox’s
run_tests), HTTP, local executors, custom client tools. The handler
doesn’t know or care about the transport.
| Param | Type | Required | Description |
|---|---|---|---|
tool | string | Yes | Registry name of the tool to invoke. |
args | object | No | Arguments passed verbatim to the tool. Defaults to {}. |
timeout_seconds | int | No | Per-call timeout. Default 120. Generous because the typical use case is a long-running test suite inside a sandbox. |
Surfaces: A E (conversation-level / session-level — invoke at the end of the session, not per turn)
Example — gating on a sandbox’s hidden test suite:
conversation_assertions: - type: tool_exec params: tool: run_tests message: "Hidden tests must pass"Pair this with a source-backed MCP entry
that supplies the run_tests tool — the sandbox lives for the
session, runs the agent’s edits, and the gate checks them at the end.
Example — pack-shipped validation tool:
conversation_assertions: - type: tool_exec params: tool: validate_invoice args: strict: true timeout_seconds: 30 message: "Final invoice must validate"Notes
- The host (arena, SDK, …) must inject a
*tools.RegistryintoEvalContext.Metadata["tool_registry"]. Arena does this automatically; SDK consumers using the runtime evals API directly need to populate it themselves. - Because the gate calls a tool, it counts toward whatever cost / side-effect budget the tool implies (e.g. running the test suite costs CPU time, an HTTP probe costs a request).
- Errors from the tool surface in the assertion’s
Explanationso failures are debuggable from the report without re-running.
Agent & Skill Checks
Section titled “Agent & Skill Checks”| Type | Params | Surfaces |
|---|---|---|
agent_invoked | agent_names (string[]) | A E |
agent_not_invoked | agent_names (string[]) | A E |
agent_response_contains | agent_name, patterns | E |
skill_activated | skill_names (string[]) | A E |
skill_not_activated | skill_names (string[]) | A E |
skill_activation_order | sequence (string[]) | A E |
Example:
assertions: - type: agent_invoked params: agent_names: ["billing-agent"]Workflow Checks
Section titled “Workflow Checks”| Type | Params | Surfaces |
|---|---|---|
workflow_complete | — | A E |
workflow_state_is | state (string) | A E |
workflow_transitioned_to | state (string) | A E |
workflow_transition_order | sequence (string[]) | A E |
workflow_tool_access | rules (array of {state, allowed}) | A E |
Example:
assertions: - type: workflow_transition_order params: sequence: ["triage", "investigation", "resolution"]Media Checks
Section titled “Media Checks”| Type | Params | Surfaces |
|---|---|---|
image_format | formats (string[]) | A E |
image_dimensions | min_width, max_width, min_height, max_height | A E |
audio_format | formats (string[]) | A E |
audio_duration | min_seconds, max_seconds | A E |
video_duration | min_seconds, max_seconds | A E |
video_resolution | min_width, max_width, presets (string[]) | A E |
LLM Judge Checks
Section titled “LLM Judge Checks”LLM judge checks send the assistant output (or full session) to a language model for evaluation. The judge returns a score (0.0—1.0) and reasoning.
llm_judge
Section titled “llm_judge”Turn-level LLM evaluation. The judge sees the current assistant response and evaluates it against the provided criteria.
| Param | Type | Required | Description |
|---|---|---|---|
criteria | string | Yes | What the judge should evaluate |
rubric | string | No | Detailed scoring guidance |
model | string | No | Model to use for judging |
system_prompt | string | No | Override the default judge system prompt |
min_score | float | No | Minimum score threshold for passing |
extra | object | No | Additional provider-specific parameters |
Surfaces: A E
llm_judge_session
Section titled “llm_judge_session”Session-level LLM evaluation. The judge sees the full conversation. Alias: llm_judge_conversation.
Same params as llm_judge. Surfaces: A E
llm_judge_tool_calls
Section titled “llm_judge_tool_calls”Evaluates tool usage patterns via an LLM judge.
| Param | Type | Required | Description |
|---|---|---|---|
criteria | string | Yes | What the judge should evaluate about tool usage |
tools | string[] | No | Filter to specific tools |
Plus all standard judge params (rubric, model, system_prompt, min_score, extra). Surfaces: A E
Example:
assertions: - type: llm_judge params: criteria: "Response is empathetic and addresses the customer's concern" min_score: 0.7RAG Checks
Section titled “RAG Checks”RAG checks are named eval primitives for retrieval-augmented generation: they score the answer against retrieved context (faithfulness, hallucination), the answer against the question (answer_relevancy), or the retrieved chunks against the question / ground truth (contextual_precision, contextual_recall, contextual_relevancy).
Each handler is a thin wrapper over llm_judge with a hardened default prompt drawn from public DeepEval / Ragas reference implementations (Apache 2.0). The standard judge params (rubric, model, system_prompt, min_score, extra) all apply; supplying system_prompt or criteria overrides the default.
Context sources — every handler that needs retrieved chunks accepts them in three forms:
| Form | Example |
|---|---|
contexts: ["chunk-1", "chunk-2"] | Canonical list form |
context: "single chunk" | Convenience form for one chunk |
context_field: retrieved_chunks | Looks up the named key in evalCtx.Metadata — use this when a retrieval tool writes chunks to metadata at runtime |
faithfulness
Section titled “faithfulness”Scores how directly the answer is supported by the supplied context. Equivalent in name to DeepEval / Ragas faithfulness.
| Param | Type | Required | Description |
|---|---|---|---|
contexts | context | context_field | string[] / string / string | Yes (one of) | Retrieved context the answer should be grounded in |
Plus standard judge params. Surfaces: A E
assertions: - type: faithfulness params: context_field: retrieved_chunks min_score: 0.8answer_relevancy
Section titled “answer_relevancy”Scores how directly the answer addresses the user’s question. Equivalent in name to DeepEval / Ragas answer_relevancy.
| Param | Type | Required | Description |
|---|---|---|---|
question | string | No | Defaults to the last user turn in the session |
Plus standard judge params. Surfaces: A E
contextual_precision
Section titled “contextual_precision”Scores the fraction of retrieved chunks that are relevant to the question (relevant chunks / total chunks). Equivalent in name to DeepEval contextual_precision.
| Param | Type | Required | Description |
|---|---|---|---|
contexts | context | context_field | — | Yes | Retrieved chunks |
question | string | No | Defaults to the last user turn |
Plus standard judge params. Surfaces: A E
contextual_recall
Section titled “contextual_recall”Scores how completely the retrieved chunks cover the information the ground-truth answer relies on. Equivalent in name to DeepEval / Ragas contextual_recall.
| Param | Type | Required | Description |
|---|---|---|---|
contexts | context | context_field | — | Yes | Retrieved chunks |
reference | expected_output | string | Yes | Ground-truth answer |
Plus standard judge params. Surfaces: A E
contextual_relevancy
Section titled “contextual_relevancy”Scores the mean per-chunk relevance of retrieved chunks to the question (distinct from contextual_precision: precision is binary relevant/not; relevancy is the mean of graded scores). Equivalent in name to DeepEval contextual_relevancy.
| Param | Type | Required | Description |
|---|---|---|---|
contexts | context | context_field | — | Yes | Retrieved chunks |
question | string | No | Defaults to the last user turn |
Plus standard judge params. Surfaces: A E
hallucination
Section titled “hallucination”Scores how free the answer is of unsupported / contradicting claims relative to the context — the inverse framing of faithfulness, kept as a distinct handler so users coming from DeepEval find the vocabulary they expect. 1.0 = no hallucination; 0.0 = entirely hallucinated. Equivalent in name to DeepEval hallucination.
| Param | Type | Required | Description |
|---|---|---|---|
contexts | context | context_field | — | Yes | Retrieved context the answer should be grounded in |
Plus standard judge params. Surfaces: A E
assertions: - type: hallucination params: contexts: - "Paris is the capital of France." min_score: 0.9Safety Checks
Section titled “Safety Checks”Safety checks score the assistant output for a specific concern: bias, toxicity, PII leakage, role violation. Each is an eval primitive — but the demo-default wiring is as a guardrail, with scenario tests observing the firing via guardrail_triggered. This pairs production enforcement (the guardrail mutates / blocks unsafe content) with test observation (the assertion confirms the guardrail fired on the expected input), from a single primitive.
The shape:
# In the pack's prompt config — runtime enforcementvalidators: - type: pii_leakage params: direction: output
# In a scenario turn — test predicateassertions: - type: guardrail_triggered params: validator: pii_leakage should_trigger: trueDirect scenario invocation (type: pii_leakage in the assertions: block with min_score) is also supported by the generic plumbing, but bypasses the production-side guardrail and is not the documented default for safety primitives.
LLM-judged safety checks (bias, toxicity, role_violation, and the LLM-judged path of pii_leakage) carry a known false-positive rate. Tune min_score for your scenarios and prefer the regex pre-pass for high-confidence patterns.
Scores the answer for demographic, stereotype, gender, racial, or religious bias. Equivalent in name to DeepEval bias.
| Param | Type | Required | Description |
|---|---|---|---|
min_score | float | No | Pass threshold |
Plus standard judge params (rubric, model, system_prompt, criteria, extra). Surfaces: A G E
toxicity
Section titled “toxicity”Scores the answer for toxic content: insults, harassment, threats, hate speech. Equivalent in name to DeepEval toxicity.
Same params as bias. Surfaces: A G E
pii_leakage
Section titled “pii_leakage”Scores the answer for personally-identifiable information leakage. Equivalent in name to DeepEval pii_leakage.
Implementation runs a regex pre-pass for high-confidence patterns (emails, US-style SSN, 16-digit card-shape numbers) before the LLM-judged path. A regex hit returns score 0 immediately without an LLM call — keeps the obvious cases cheap and deterministic. Ambiguous patterns fall through to the LLM judge.
Same params as bias. Surfaces: A G E
role_violation
Section titled “role_violation”Scores the answer for adherence to the assigned role / persona / instruction set. Equivalent in name to DeepEval role_violation.
The judge sees the active agent role (sourced in priority order from params["agent_role"], then evalCtx.Metadata["system_prompt"]) so it can decide whether the answer deviates. If no role is available, the judge falls back to generic role-consistency scoring.
| Param | Type | Required | Description |
|---|---|---|---|
agent_role | string | No | The persona / system prompt the answer should follow. Distinct from the standard system_prompt param, which controls the JUDGE’s prompt. |
min_score | float | No | Pass threshold |
Plus standard judge params. Surfaces: A G E
External Checks
Section titled “External Checks”External checks delegate evaluation to HTTP endpoints or A2A agents. These are the no-code extensibility points for teams that want custom evaluation logic without writing Go.
rest_eval
Section titled “rest_eval”POSTs turn data to an HTTP endpoint. The endpoint must return {"score": float, "reasoning": string}. The passed field is accepted for backward compatibility but ignored — pass/fail is determined by the assertion or guardrail wrapper based on score thresholds.
| Param | Type | Required | Description |
|---|---|---|---|
url | string | Yes | Endpoint URL |
method | string | No | HTTP method (default: POST) |
headers | object | No | Request headers; values support ${ENV_VAR} expansion |
timeout | string | No | Request timeout |
include_messages | bool | No | Include conversation messages in payload |
include_tool_calls | bool | No | Include tool call records in payload |
criteria | string | No | Evaluation criteria passed to the endpoint |
min_score | float | No | Minimum score threshold |
extra | object | No | Additional fields merged into the request body |
Surfaces: A E
rest_eval_session
Section titled “rest_eval_session”POSTs full session data to an HTTP endpoint. Same params as rest_eval. Surfaces: A E
a2a_eval
Section titled “a2a_eval”Sends evaluation data to an A2A-protocol eval agent.
| Param | Type | Required | Description |
|---|---|---|---|
agent_url | string | Yes | URL of the A2A eval agent |
auth_token | string | No | Auth token; supports ${ENV_VAR} expansion |
criteria | string | No | Evaluation criteria |
min_score | float | No | Minimum score threshold |
Surfaces: A E
a2a_eval_session
Section titled “a2a_eval_session”Session-level A2A evaluation. Same params as a2a_eval. Surfaces: A E
Example:
evals: - id: safety_check type: rest_eval trigger: every_turn params: url: "https://safety.internal/evaluate" headers: Authorization: "Bearer ${SAFETY_API_KEY}" criteria: "Content is safe for all audiences" min_score: 0.9Budget & Performance Checks
Section titled “Budget & Performance Checks”| Type | Params | Surfaces |
|---|---|---|
latency_budget | max_ms (int) | A E |
cost_budget | max_cost_usd, max_total_tokens | E |
cost_budget is session-level and fires on on_session_complete.
Meta Checks
Section titled “Meta Checks”| Type | Params | Surfaces |
|---|---|---|
guardrail_triggered | guardrail (string), should_trigger (bool) | A E |
invariant_fields_preserved | tool (string), fields (string[]) | E |
guardrail_triggered inspects prior eval results in the same batch, verifying that a specific guardrail did (or did not) fire.
Behavioral Testing Checks
Section titled “Behavioral Testing Checks”These checks compare behavior across prompt variants or input perturbations.
| Type | Params | Surfaces |
|---|---|---|
outcome_equivalent | metric ("tool_calls" | "final_state" | "content_hash") | E |
directional | check ("same_tool_calls" | "same_outcome" | "similar_content") | E |
Param Aliases
Section titled “Param Aliases”For backward compatibility, some parameter names are aliased. When you use an aliased param name, it is automatically mapped to the canonical name before the handler runs.
| Check Type(s) | Alias Param | Canonical Param |
|---|---|---|
content_excludes, banned_words | words | patterns |
max_length, length | max_characters, max_chars | max |
min_length | min_characters, min_chars | min |
sentence_count, max_sentences | max_sentences | max |
field_presence, required_fields | required_fields | fields |
Extending the Check System
Section titled “Extending the Check System”PromptKit provides several extensibility points for adding custom check logic.
Custom EvalTypeHandler (Go)
Section titled “Custom EvalTypeHandler (Go)”Implement the EvalTypeHandler interface and register it:
type EvalTypeHandler interface { Type() string Eval(ctx context.Context, evalCtx *EvalContext, params map[string]any) (*EvalResult, error)}Register globally (available to all registries):
evals.RegisterDefault(handler)Or register on a specific registry instance:
registry.Register(handler)StreamableEvalHandler (Go)
Section titled “StreamableEvalHandler (Go)”For checks that need streaming support in guardrails, implement StreamableEvalHandler. This enables incremental evaluation on each streaming chunk, allowing early abort.
type StreamableEvalHandler interface { EvalTypeHandler EvalPartial(ctx context.Context, content string, params map[string]any) (*EvalResult, error)}Exec Eval Handlers (Any Language)
Section titled “Exec Eval Handlers (Any Language)”Define eval handlers as external subprocesses in RuntimeConfig YAML. The subprocess receives JSON on stdin and writes JSON to stdout, so you can use any language.
spec: evals: my_python_eval: command: python3 args: ["./evaluators/my_eval.py"] env: ["EVAL_TYPE=my_python_eval"] timeoutMs: 5000Stdin receives:
{"type": "my_python_eval", "params": {...}, "content": "...", "context": {...}}Stdout must return:
{"score": 0.85, "detail": "Explanation text", "data": {}}The score value (0.0—1.0) is the eval’s output. Pass/fail is not determined by the handler — assertion and guardrail wrappers apply score thresholds to determine pass/fail.
Custom JudgeProvider (Go)
Section titled “Custom JudgeProvider (Go)”Customize how LLM judge checks call language models:
type JudgeProvider interface { Judge(ctx context.Context, opts JudgeOpts) (*JudgeResult, error)}Register via sdk.WithJudgeProvider(provider) when opening a conversation.
Custom ProviderHook (Go, for guardrails)
Section titled “Custom ProviderHook (Go, for guardrails)”For custom runtime guardrails beyond the built-in check types, implement ProviderHook to intercept LLM calls:
type ProviderHook interface { Name() string BeforeCall(ctx context.Context, req *ProviderRequest) Decision AfterCall(ctx context.Context, req *ProviderRequest, resp *ProviderResponse) Decision}Optionally implement ChunkInterceptor for streaming interception:
type ChunkInterceptor interface { OnChunk(ctx context.Context, chunk *providers.StreamChunk) Decision}Register via sdk.WithProviderHook(hook).
REST and A2A External Checks
Section titled “REST and A2A External Checks”For no-code extensibility, use the rest_eval and a2a_eval check types. These let you delegate evaluation to any HTTP endpoint or A2A-compatible agent without writing Go code.
See Also
Section titled “See Also”- Unified Check Model — How checks, assertions, guardrails, and evals relate
- Write Assertions — Using checks in Arena test scenarios
- Add Guardrails — Using checks as runtime policy enforcers
- Eval Framework — Production eval architecture
- Run Evals — Programmatic eval execution