Skip to content

Checks Reference

PromptKit has a unified check system: one set of check types usable across three surfaces.

  • Assertions — validate LLM behavior in Arena test scenarios (assertions: field in scenario YAML).
  • Guardrails — enforce runtime policy in production (validators: field in pack YAML).
  • Evals — monitor quality in production (evals: field in pack YAML).

All check types are implemented as EvalTypeHandler instances registered in the runtime/evals/ package. See the Unified Check Model for the conceptual overview.

Surface legend used in the tables below:

SymbolSurface
AAssertion
GGuardrail
EEval

“Streaming” indicates whether the check supports incremental evaluation during streaming responses (relevant for guardrails).


TypeAliasesParamsSurfacesStreaming
containscontent_includespatterns (string[])A G ENo
regexcontent_matchespattern (string)A G ENo
content_excludesbanned_words, content_not_includespatterns (string[])A G EYes
contains_anycontent_includes_anypatterns (string[])A ENo
min_lengthmin or min_characters (int)A ENo
max_lengthlengthmax or max_characters (int), max_tokens (int)A G EYes
sentence_countmax_sentencesmax or max_sentences (int)A G ENo
field_presencerequired_fieldsfields or required_fields (string[])A G ENo
cosine_similarityreference (string), min_similarity (float)A ENo

When content_excludes is invoked via the banned_words alias, match_mode defaults to word_boundary.

Example — assertion (scenario YAML):

assertions:
- type: contains
params:
patterns: ["thank you", "welcome"]

Example — guardrail (pack YAML):

validators:
- type: banned_words
params:
patterns: ["competitor-name", "internal-only"]

Example — eval (pack YAML):

evals:
- id: response_length
type: max_length
trigger: every_turn
params:
max: 500

TypeAliasesParamsSurfaces
json_validis_valid_json, valid_jsonA E
json_schemaschema (object)A E
json_pathexpression (string), expected, contains, min_results, max_resultsA E

Example:

assertions:
- type: json_path
params:
expression: "$.order.status"
expected: "confirmed"

These checks evaluate tool usage within a single assistant turn.

TypeAliasesParamsSurfaces
tools_calledtool_calledtool_names (string[]), min_calls (int, default 1), ignore_validation (bool), require_args (bool)A G E
tools_not_calledtool_names (string[])A G E
tool_argstool_name (string), expected_args (object)A E
tool_calls_with_argstool_name, expected_args, result_includesA E
tool_call_counttool (string), min (int), max (int)A E
tool_call_sequencesequence (string[])A E
tool_call_chainchain (string[])A E
tool_anti_patternpatterns (array of {sequence, message})A E
tool_no_repeattools (string[]), max_repeats (int)E
tool_efficiencymax_calls, max_errors, max_error_rateE
no_tool_errorsA E
tool_result_includestool_name, patterns (string[])A E
tool_result_matchestool_name, pattern (string)A E
tool_result_has_mediatool_nameE
tool_result_media_typetool_name, media_typeE

Example:

assertions:
- type: tool_call_sequence
params:
sequence: ["lookup_customer", "create_ticket"]

These checks evaluate tool usage across the entire session.

TypeAliasesParamsSurfaces
tools_called (session)tool_calledtool_names (string[])A G E
tools_not_called (session)tool_names (string[])A G E
tool_args (session)tool_name, expected_argsA E
tool_args_excluded_sessiontools_not_called_with_argstool_name, excluded_argsA E

Session-level tool checks use the on_session_complete or on_conversation_complete trigger.


Unlike the tool checks above (which evaluate tools the agent already called), this check invokes a tool itself and asserts on the result. Typical use is to run a verification tool — a sandbox’s run_tests, a render-and-diff utility, a custom HTTP probe — as the hard gate after the conversation completes.

Invokes a registered tool by name through the runtime tool registry and passes if the call succeeded. The pass condition is:

  • tools.Registry.Execute returns no error, and
  • the resulting ToolResult.Error field is empty.

This makes tool_exec a generic “is this tool happy” gate that works with any registered tool — MCP-discovered (e.g. a sandbox’s run_tests), HTTP, local executors, custom client tools. The handler doesn’t know or care about the transport.

ParamTypeRequiredDescription
toolstringYesRegistry name of the tool to invoke.
argsobjectNoArguments passed verbatim to the tool. Defaults to {}.
timeout_secondsintNoPer-call timeout. Default 120. Generous because the typical use case is a long-running test suite inside a sandbox.

Surfaces: A E (conversation-level / session-level — invoke at the end of the session, not per turn)

Example — gating on a sandbox’s hidden test suite:

conversation_assertions:
- type: tool_exec
params:
tool: run_tests
message: "Hidden tests must pass"

Pair this with a source-backed MCP entry that supplies the run_tests tool — the sandbox lives for the session, runs the agent’s edits, and the gate checks them at the end.

Example — pack-shipped validation tool:

conversation_assertions:
- type: tool_exec
params:
tool: validate_invoice
args:
strict: true
timeout_seconds: 30
message: "Final invoice must validate"

Notes

  • The host (arena, SDK, …) must inject a *tools.Registry into EvalContext.Metadata["tool_registry"]. Arena does this automatically; SDK consumers using the runtime evals API directly need to populate it themselves.
  • Because the gate calls a tool, it counts toward whatever cost / side-effect budget the tool implies (e.g. running the test suite costs CPU time, an HTTP probe costs a request).
  • Errors from the tool surface in the assertion’s Explanation so failures are debuggable from the report without re-running.

TypeParamsSurfaces
agent_invokedagent_names (string[])A E
agent_not_invokedagent_names (string[])A E
agent_response_containsagent_name, patternsE
skill_activatedskill_names (string[])A E
skill_not_activatedskill_names (string[])A E
skill_activation_ordersequence (string[])A E

Example:

assertions:
- type: agent_invoked
params:
agent_names: ["billing-agent"]

TypeParamsSurfaces
workflow_completeA E
workflow_state_isstate (string)A E
workflow_transitioned_tostate (string)A E
workflow_transition_ordersequence (string[])A E
workflow_tool_accessrules (array of {state, allowed})A E

Example:

assertions:
- type: workflow_transition_order
params:
sequence: ["triage", "investigation", "resolution"]

TypeParamsSurfaces
image_formatformats (string[])A E
image_dimensionsmin_width, max_width, min_height, max_heightA E
audio_formatformats (string[])A E
audio_durationmin_seconds, max_secondsA E
video_durationmin_seconds, max_secondsA E
video_resolutionmin_width, max_width, presets (string[])A E

LLM judge checks send the assistant output (or full session) to a language model for evaluation. The judge returns a score (0.0—1.0) and reasoning.

Turn-level LLM evaluation. The judge sees the current assistant response and evaluates it against the provided criteria.

ParamTypeRequiredDescription
criteriastringYesWhat the judge should evaluate
rubricstringNoDetailed scoring guidance
modelstringNoModel to use for judging
system_promptstringNoOverride the default judge system prompt
min_scorefloatNoMinimum score threshold for passing
extraobjectNoAdditional provider-specific parameters

Surfaces: A E

Session-level LLM evaluation. The judge sees the full conversation. Alias: llm_judge_conversation.

Same params as llm_judge. Surfaces: A E

Evaluates tool usage patterns via an LLM judge.

ParamTypeRequiredDescription
criteriastringYesWhat the judge should evaluate about tool usage
toolsstring[]NoFilter to specific tools

Plus all standard judge params (rubric, model, system_prompt, min_score, extra). Surfaces: A E

Example:

assertions:
- type: llm_judge
params:
criteria: "Response is empathetic and addresses the customer's concern"
min_score: 0.7

RAG checks are named eval primitives for retrieval-augmented generation: they score the answer against retrieved context (faithfulness, hallucination), the answer against the question (answer_relevancy), or the retrieved chunks against the question / ground truth (contextual_precision, contextual_recall, contextual_relevancy).

Each handler is a thin wrapper over llm_judge with a hardened default prompt drawn from public DeepEval / Ragas reference implementations (Apache 2.0). The standard judge params (rubric, model, system_prompt, min_score, extra) all apply; supplying system_prompt or criteria overrides the default.

Context sources — every handler that needs retrieved chunks accepts them in three forms:

FormExample
contexts: ["chunk-1", "chunk-2"]Canonical list form
context: "single chunk"Convenience form for one chunk
context_field: retrieved_chunksLooks up the named key in evalCtx.Metadata — use this when a retrieval tool writes chunks to metadata at runtime

Scores how directly the answer is supported by the supplied context. Equivalent in name to DeepEval / Ragas faithfulness.

ParamTypeRequiredDescription
contexts | context | context_fieldstring[] / string / stringYes (one of)Retrieved context the answer should be grounded in

Plus standard judge params. Surfaces: A E

assertions:
- type: faithfulness
params:
context_field: retrieved_chunks
min_score: 0.8

Scores how directly the answer addresses the user’s question. Equivalent in name to DeepEval / Ragas answer_relevancy.

ParamTypeRequiredDescription
questionstringNoDefaults to the last user turn in the session

Plus standard judge params. Surfaces: A E

Scores the fraction of retrieved chunks that are relevant to the question (relevant chunks / total chunks). Equivalent in name to DeepEval contextual_precision.

ParamTypeRequiredDescription
contexts | context | context_fieldYesRetrieved chunks
questionstringNoDefaults to the last user turn

Plus standard judge params. Surfaces: A E

Scores how completely the retrieved chunks cover the information the ground-truth answer relies on. Equivalent in name to DeepEval / Ragas contextual_recall.

ParamTypeRequiredDescription
contexts | context | context_fieldYesRetrieved chunks
reference | expected_outputstringYesGround-truth answer

Plus standard judge params. Surfaces: A E

Scores the mean per-chunk relevance of retrieved chunks to the question (distinct from contextual_precision: precision is binary relevant/not; relevancy is the mean of graded scores). Equivalent in name to DeepEval contextual_relevancy.

ParamTypeRequiredDescription
contexts | context | context_fieldYesRetrieved chunks
questionstringNoDefaults to the last user turn

Plus standard judge params. Surfaces: A E

Scores how free the answer is of unsupported / contradicting claims relative to the context — the inverse framing of faithfulness, kept as a distinct handler so users coming from DeepEval find the vocabulary they expect. 1.0 = no hallucination; 0.0 = entirely hallucinated. Equivalent in name to DeepEval hallucination.

ParamTypeRequiredDescription
contexts | context | context_fieldYesRetrieved context the answer should be grounded in

Plus standard judge params. Surfaces: A E

assertions:
- type: hallucination
params:
contexts:
- "Paris is the capital of France."
min_score: 0.9

Safety checks score the assistant output for a specific concern: bias, toxicity, PII leakage, role violation. Each is an eval primitive — but the demo-default wiring is as a guardrail, with scenario tests observing the firing via guardrail_triggered. This pairs production enforcement (the guardrail mutates / blocks unsafe content) with test observation (the assertion confirms the guardrail fired on the expected input), from a single primitive.

The shape:

# In the pack's prompt config — runtime enforcement
validators:
- type: pii_leakage
params:
direction: output
# In a scenario turn — test predicate
assertions:
- type: guardrail_triggered
params:
validator: pii_leakage
should_trigger: true

Direct scenario invocation (type: pii_leakage in the assertions: block with min_score) is also supported by the generic plumbing, but bypasses the production-side guardrail and is not the documented default for safety primitives.

LLM-judged safety checks (bias, toxicity, role_violation, and the LLM-judged path of pii_leakage) carry a known false-positive rate. Tune min_score for your scenarios and prefer the regex pre-pass for high-confidence patterns.

Scores the answer for demographic, stereotype, gender, racial, or religious bias. Equivalent in name to DeepEval bias.

ParamTypeRequiredDescription
min_scorefloatNoPass threshold

Plus standard judge params (rubric, model, system_prompt, criteria, extra). Surfaces: A G E

Scores the answer for toxic content: insults, harassment, threats, hate speech. Equivalent in name to DeepEval toxicity.

Same params as bias. Surfaces: A G E

Scores the answer for personally-identifiable information leakage. Equivalent in name to DeepEval pii_leakage.

Implementation runs a regex pre-pass for high-confidence patterns (emails, US-style SSN, 16-digit card-shape numbers) before the LLM-judged path. A regex hit returns score 0 immediately without an LLM call — keeps the obvious cases cheap and deterministic. Ambiguous patterns fall through to the LLM judge.

Same params as bias. Surfaces: A G E

Scores the answer for adherence to the assigned role / persona / instruction set. Equivalent in name to DeepEval role_violation.

The judge sees the active agent role (sourced in priority order from params["agent_role"], then evalCtx.Metadata["system_prompt"]) so it can decide whether the answer deviates. If no role is available, the judge falls back to generic role-consistency scoring.

ParamTypeRequiredDescription
agent_rolestringNoThe persona / system prompt the answer should follow. Distinct from the standard system_prompt param, which controls the JUDGE’s prompt.
min_scorefloatNoPass threshold

Plus standard judge params. Surfaces: A G E


External checks delegate evaluation to HTTP endpoints or A2A agents. These are the no-code extensibility points for teams that want custom evaluation logic without writing Go.

POSTs turn data to an HTTP endpoint. The endpoint must return {"score": float, "reasoning": string}. The passed field is accepted for backward compatibility but ignored — pass/fail is determined by the assertion or guardrail wrapper based on score thresholds.

ParamTypeRequiredDescription
urlstringYesEndpoint URL
methodstringNoHTTP method (default: POST)
headersobjectNoRequest headers; values support ${ENV_VAR} expansion
timeoutstringNoRequest timeout
include_messagesboolNoInclude conversation messages in payload
include_tool_callsboolNoInclude tool call records in payload
criteriastringNoEvaluation criteria passed to the endpoint
min_scorefloatNoMinimum score threshold
extraobjectNoAdditional fields merged into the request body

Surfaces: A E

POSTs full session data to an HTTP endpoint. Same params as rest_eval. Surfaces: A E

Sends evaluation data to an A2A-protocol eval agent.

ParamTypeRequiredDescription
agent_urlstringYesURL of the A2A eval agent
auth_tokenstringNoAuth token; supports ${ENV_VAR} expansion
criteriastringNoEvaluation criteria
min_scorefloatNoMinimum score threshold

Surfaces: A E

Session-level A2A evaluation. Same params as a2a_eval. Surfaces: A E

Example:

evals:
- id: safety_check
type: rest_eval
trigger: every_turn
params:
url: "https://safety.internal/evaluate"
headers:
Authorization: "Bearer ${SAFETY_API_KEY}"
criteria: "Content is safe for all audiences"
min_score: 0.9

TypeParamsSurfaces
latency_budgetmax_ms (int)A E
cost_budgetmax_cost_usd, max_total_tokensE

cost_budget is session-level and fires on on_session_complete.


TypeParamsSurfaces
guardrail_triggeredguardrail (string), should_trigger (bool)A E
invariant_fields_preservedtool (string), fields (string[])E

guardrail_triggered inspects prior eval results in the same batch, verifying that a specific guardrail did (or did not) fire.


These checks compare behavior across prompt variants or input perturbations.

TypeParamsSurfaces
outcome_equivalentmetric ("tool_calls" | "final_state" | "content_hash")E
directionalcheck ("same_tool_calls" | "same_outcome" | "similar_content")E

For backward compatibility, some parameter names are aliased. When you use an aliased param name, it is automatically mapped to the canonical name before the handler runs.

Check Type(s)Alias ParamCanonical Param
content_excludes, banned_wordswordspatterns
max_length, lengthmax_characters, max_charsmax
min_lengthmin_characters, min_charsmin
sentence_count, max_sentencesmax_sentencesmax
field_presence, required_fieldsrequired_fieldsfields

PromptKit provides several extensibility points for adding custom check logic.

Implement the EvalTypeHandler interface and register it:

type EvalTypeHandler interface {
Type() string
Eval(ctx context.Context, evalCtx *EvalContext, params map[string]any) (*EvalResult, error)
}

Register globally (available to all registries):

evals.RegisterDefault(handler)

Or register on a specific registry instance:

registry.Register(handler)

For checks that need streaming support in guardrails, implement StreamableEvalHandler. This enables incremental evaluation on each streaming chunk, allowing early abort.

type StreamableEvalHandler interface {
EvalTypeHandler
EvalPartial(ctx context.Context, content string, params map[string]any) (*EvalResult, error)
}

Define eval handlers as external subprocesses in RuntimeConfig YAML. The subprocess receives JSON on stdin and writes JSON to stdout, so you can use any language.

spec:
evals:
my_python_eval:
command: python3
args: ["./evaluators/my_eval.py"]
env: ["EVAL_TYPE=my_python_eval"]
timeoutMs: 5000

Stdin receives:

{"type": "my_python_eval", "params": {...}, "content": "...", "context": {...}}

Stdout must return:

{"score": 0.85, "detail": "Explanation text", "data": {}}

The score value (0.0—1.0) is the eval’s output. Pass/fail is not determined by the handler — assertion and guardrail wrappers apply score thresholds to determine pass/fail.

Customize how LLM judge checks call language models:

type JudgeProvider interface {
Judge(ctx context.Context, opts JudgeOpts) (*JudgeResult, error)
}

Register via sdk.WithJudgeProvider(provider) when opening a conversation.

For custom runtime guardrails beyond the built-in check types, implement ProviderHook to intercept LLM calls:

type ProviderHook interface {
Name() string
BeforeCall(ctx context.Context, req *ProviderRequest) Decision
AfterCall(ctx context.Context, req *ProviderRequest, resp *ProviderResponse) Decision
}

Optionally implement ChunkInterceptor for streaming interception:

type ChunkInterceptor interface {
OnChunk(ctx context.Context, chunk *providers.StreamChunk) Decision
}

Register via sdk.WithProviderHook(hook).

For no-code extensibility, use the rest_eval and a2a_eval check types. These let you delegate evaluation to any HTTP endpoint or A2A-compatible agent without writing Go code.