Skip to content

Checks Reference

PromptKit has a unified check system: one set of check types usable across three surfaces.

  • Assertions — validate LLM behavior in Arena test scenarios (assertions: field in scenario YAML).
  • Guardrails — enforce runtime policy in production (validators: field in pack YAML).
  • Evals — monitor quality in production (evals: field in pack YAML).

All check types are implemented as EvalTypeHandler instances registered in the runtime/evals/ package. See the Unified Check Model for the conceptual overview.

Surface legend used in the tables below:

SymbolSurface
AAssertion
GGuardrail
EEval

“Streaming” indicates whether the check supports incremental evaluation during streaming responses (relevant for guardrails).


TypeAliasesParamsSurfacesStreaming
containscontent_includespatterns (string[])A G ENo
regexcontent_matchespattern (string)A G ENo
content_excludesbanned_words, content_not_includespatterns (string[])A G EYes
contains_anycontent_includes_anypatterns (string[])A ENo
min_lengthmin or min_characters (int)A ENo
max_lengthlengthmax or max_characters (int), max_tokens (int)A G EYes
sentence_countmax_sentencesmax or max_sentences (int)A G ENo
field_presencerequired_fieldsfields or required_fields (string[])A G ENo
cosine_similarityreference (string), min_similarity (float)A ENo

When content_excludes is invoked via the banned_words alias, match_mode defaults to word_boundary.

Example — assertion (scenario YAML):

assertions:
- type: contains
params:
patterns: ["thank you", "welcome"]

Example — guardrail (pack YAML):

validators:
- type: banned_words
params:
patterns: ["competitor-name", "internal-only"]

Example — eval (pack YAML):

evals:
- id: response_length
type: max_length
trigger: every_turn
params:
max: 500

TypeAliasesParamsSurfaces
json_validis_valid_json, valid_jsonA E
json_schemaschema (object)A E
json_pathexpression (string), expected, contains, min_results, max_resultsA E

Example:

assertions:
- type: json_path
params:
expression: "$.order.status"
expected: "confirmed"

These checks evaluate tool usage within a single assistant turn.

TypeAliasesParamsSurfaces
tools_calledtool_calledtool_names (string[])A G E
tools_not_calledtool_names (string[])A G E
tool_argstool_name (string), expected_args (object)A E
tool_calls_with_argstool_name, expected_args, result_includesA E
tool_call_counttool (string), min (int), max (int)A E
tool_call_sequencesequence (string[])A E
tool_call_chainchain (string[])A E
tool_anti_patternpatterns (array of {sequence, message})A E
tool_no_repeattools (string[]), max_repeats (int)E
tool_efficiencymax_calls, max_errors, max_error_rateE
no_tool_errorsA E
tool_result_includestool_name, patterns (string[])A E
tool_result_matchestool_name, pattern (string)A E
tool_result_has_mediatool_nameE
tool_result_media_typetool_name, media_typeE

Example:

assertions:
- type: tool_call_sequence
params:
sequence: ["lookup_customer", "create_ticket"]

These checks evaluate tool usage across the entire session.

TypeAliasesParamsSurfaces
tools_called (session)tool_calledtool_names (string[])A G E
tools_not_called (session)tool_names (string[])A G E
tool_args (session)tool_name, expected_argsA E
tool_args_excluded_sessiontools_not_called_with_argstool_name, excluded_argsA E

Session-level tool checks use the on_session_complete or on_conversation_complete trigger.


TypeParamsSurfaces
agent_invokedagent_names (string[])A E
agent_not_invokedagent_names (string[])A E
agent_response_containsagent_name, patternsE
skill_activatedskill_names (string[])A E
skill_not_activatedskill_names (string[])A E
skill_activation_ordersequence (string[])A E

Example:

assertions:
- type: agent_invoked
params:
agent_names: ["billing-agent"]

TypeParamsSurfaces
workflow_completeA E
workflow_state_isstate (string)A E
workflow_transitioned_tostate (string)A E
workflow_transition_ordersequence (string[])A E
workflow_tool_accessrules (array of {state, allowed})A E

Example:

assertions:
- type: workflow_transition_order
params:
sequence: ["triage", "investigation", "resolution"]

TypeParamsSurfaces
image_formatformats (string[])A E
image_dimensionsmin_width, max_width, min_height, max_heightA E
audio_formatformats (string[])A E
audio_durationmin_seconds, max_secondsA E
video_durationmin_seconds, max_secondsA E
video_resolutionmin_width, max_width, presets (string[])A E

LLM judge checks send the assistant output (or full session) to a language model for evaluation. The judge returns a score (0.0—1.0) and reasoning.

Turn-level LLM evaluation. The judge sees the current assistant response and evaluates it against the provided criteria.

ParamTypeRequiredDescription
criteriastringYesWhat the judge should evaluate
rubricstringNoDetailed scoring guidance
modelstringNoModel to use for judging
system_promptstringNoOverride the default judge system prompt
min_scorefloatNoMinimum score threshold for passing
extraobjectNoAdditional provider-specific parameters

Surfaces: A E

Session-level LLM evaluation. The judge sees the full conversation. Alias: llm_judge_conversation.

Same params as llm_judge. Surfaces: A E

Evaluates tool usage patterns via an LLM judge.

ParamTypeRequiredDescription
criteriastringYesWhat the judge should evaluate about tool usage
toolsstring[]NoFilter to specific tools

Plus all standard judge params (rubric, model, system_prompt, min_score, extra). Surfaces: A E

Example:

assertions:
- type: llm_judge
params:
criteria: "Response is empathetic and addresses the customer's concern"
min_score: 0.7

External checks delegate evaluation to HTTP endpoints or A2A agents. These are the no-code extensibility points for teams that want custom evaluation logic without writing Go.

POSTs turn data to an HTTP endpoint. The endpoint must return {"score": float, "reasoning": string}. The passed field is accepted for backward compatibility but ignored — pass/fail is determined by the assertion or guardrail wrapper based on score thresholds.

ParamTypeRequiredDescription
urlstringYesEndpoint URL
methodstringNoHTTP method (default: POST)
headersobjectNoRequest headers; values support ${ENV_VAR} expansion
timeoutstringNoRequest timeout
include_messagesboolNoInclude conversation messages in payload
include_tool_callsboolNoInclude tool call records in payload
criteriastringNoEvaluation criteria passed to the endpoint
min_scorefloatNoMinimum score threshold
extraobjectNoAdditional fields merged into the request body

Surfaces: A E

POSTs full session data to an HTTP endpoint. Same params as rest_eval. Surfaces: A E

Sends evaluation data to an A2A-protocol eval agent.

ParamTypeRequiredDescription
agent_urlstringYesURL of the A2A eval agent
auth_tokenstringNoAuth token; supports ${ENV_VAR} expansion
criteriastringNoEvaluation criteria
min_scorefloatNoMinimum score threshold

Surfaces: A E

Session-level A2A evaluation. Same params as a2a_eval. Surfaces: A E

Example:

evals:
- id: safety_check
type: rest_eval
trigger: every_turn
params:
url: "https://safety.internal/evaluate"
headers:
Authorization: "Bearer ${SAFETY_API_KEY}"
criteria: "Content is safe for all audiences"
min_score: 0.9

TypeParamsSurfaces
latency_budgetmax_ms (int)A E
cost_budgetmax_cost_usd, max_total_tokensE

cost_budget is session-level and fires on on_session_complete.


TypeParamsSurfaces
guardrail_triggeredguardrail (string), should_trigger (bool)A E
invariant_fields_preservedtool (string), fields (string[])E

guardrail_triggered inspects prior eval results in the same batch, verifying that a specific guardrail did (or did not) fire.


These checks compare behavior across prompt variants or input perturbations.

TypeParamsSurfaces
outcome_equivalentmetric ("tool_calls" | "final_state" | "content_hash")E
directionalcheck ("same_tool_calls" | "same_outcome" | "similar_content")E

For backward compatibility, some parameter names are aliased. When you use an aliased param name, it is automatically mapped to the canonical name before the handler runs.

Check Type(s)Alias ParamCanonical Param
content_excludes, banned_wordswordspatterns
max_length, lengthmax_characters, max_charsmax
min_lengthmin_characters, min_charsmin
sentence_count, max_sentencesmax_sentencesmax
field_presence, required_fieldsrequired_fieldsfields

PromptKit provides several extensibility points for adding custom check logic.

Implement the EvalTypeHandler interface and register it:

type EvalTypeHandler interface {
Type() string
Eval(ctx context.Context, evalCtx *EvalContext, params map[string]any) (*EvalResult, error)
}

Register globally (available to all registries):

evals.RegisterDefault(handler)

Or register on a specific registry instance:

registry.Register(handler)

For checks that need streaming support in guardrails, implement StreamableEvalHandler. This enables incremental evaluation on each streaming chunk, allowing early abort.

type StreamableEvalHandler interface {
EvalTypeHandler
EvalPartial(ctx context.Context, content string, params map[string]any) (*EvalResult, error)
}

Define eval handlers as external subprocesses in RuntimeConfig YAML. The subprocess receives JSON on stdin and writes JSON to stdout, so you can use any language.

spec:
evals:
my_python_eval:
command: python3
args: ["./evaluators/my_eval.py"]
env: ["EVAL_TYPE=my_python_eval"]
timeoutMs: 5000

Stdin receives:

{"type": "my_python_eval", "params": {...}, "content": "...", "context": {...}}

Stdout must return:

{"score": 0.85, "detail": "Explanation text", "data": {}}

The score value (0.0—1.0) is the eval’s output. Pass/fail is not determined by the handler — assertion and guardrail wrappers apply score thresholds to determine pass/fail.

Customize how LLM judge checks call language models:

type JudgeProvider interface {
Judge(ctx context.Context, opts JudgeOpts) (*JudgeResult, error)
}

Register via sdk.WithJudgeProvider(provider) when opening a conversation.

For custom runtime guardrails beyond the built-in check types, implement ProviderHook to intercept LLM calls:

type ProviderHook interface {
Name() string
BeforeCall(ctx context.Context, req *ProviderRequest) Decision
AfterCall(ctx context.Context, req *ProviderRequest, resp *ProviderResponse) Decision
}

Optionally implement ChunkInterceptor for streaming interception:

type ChunkInterceptor interface {
OnChunk(ctx context.Context, chunk *providers.StreamChunk) Decision
}

Register via sdk.WithProviderHook(hook).

For no-code extensibility, use the rest_eval and a2a_eval check types. These let you delegate evaluation to any HTTP endpoint or A2A-compatible agent without writing Go code.