Eval Framework
Understanding PromptKit’s automated evaluation system for LLM outputs.
Overview
Section titled “Overview”Evals are automated quality checks that run against LLM outputs. They answer questions like “Did the assistant stay on topic?”, “Was the JSON valid?”, or “Did it call the right tools?”. Evals are defined in pack files and execute automatically during conversations or against recorded sessions.
Evals use the same check types as assertions and guardrails. The difference is when and where they run: evals can fire in production on every turn, on a sampled subset, or at session close, whereas assertions only run during Arena tests and guardrails run inline before the response is delivered.
Eval handlers produce scores only (0.0–1.0). They never determine pass/fail — that responsibility belongs to assertion and guardrail wrappers. When used as standalone evals, the score is recorded as a metric and emitted as an event.
Pack File (evals) ──► EvalRunner ──► ResultWriter ──► Metrics / MetadataPack Evals vs Scenario Assertions
Section titled “Pack Evals vs Scenario Assertions”PromptKit offers two complementary evaluation mechanisms that share the same underlying check types:
| Pack Evals | Scenario Assertions | |
|---|---|---|
| Defined in | Pack file (evals array) | Arena scenario YAML |
| Scope | Any conversation using the pack | Specific test scenarios |
| When | Production + testing | Testing only |
| Check types | Any check from the unified catalog | Any check from the unified catalog |
| Trigger | Configurable (every turn, sampling, session close) | Every turn / conversation end |
Pack evals travel with your pack — they run in production, in Arena tests, and anywhere the pack is used. Think of them as built-in quality monitors.
Scenario assertions are Arena-specific test expectations. They validate specific conversation flows defined in your test scenarios.
Both can coexist: pack evals provide baseline quality monitoring while scenario assertions verify specific behaviors. See Unified Check Model for how evals, assertions, and guardrails relate.
Eval Definition Structure
Section titled “Eval Definition Structure”Each eval is an EvalDef object in the pack’s evals array. The structure combines a check type with trigger, sampling, threshold, and metric configuration:
{ "id": "quality_check", "type": "contains", "trigger": "every_turn", "params": { "patterns": ["thank you"] }, "threshold": { "min_score": 0.8 }, "enabled": true, "sample_percentage": 10, "metric": { "name": "response_quality", "type": "gauge", "labels": { "category": "tone" } }}| Field | Required | Description |
|---|---|---|
id | Yes | Unique identifier for the eval within the pack |
type | Yes | Check type from the Checks Reference |
trigger | Yes | When the eval fires (see Triggers) |
params | Varies | Parameters specific to the check type |
threshold | No | Pass/fail threshold (e.g. min_score) |
enabled | No | Whether the eval is active (default: true) |
sample_percentage | No | Percentage of turns/sessions to evaluate (for sampling triggers) |
groups | No | Eval groups for filtering (see Eval Groups) |
metric | No | Prometheus metric configuration (see Metrics & Prometheus) |
Triggers
Section titled “Triggers”Each eval specifies when it should fire:
| Trigger | Description | Use Case |
|---|---|---|
every_turn | After each assistant response | Real-time quality checks |
on_session_complete | When session closes | Summary evaluations |
sample_turns | Percentage of turns (hash-based) | Production sampling |
sample_sessions | Percentage of sessions (hash-based) | Production sampling |
on_conversation_complete | When multi-session conversation closes | Final evaluation |
on_workflow_step | After a workflow state transition | Workflow validation |
Sampling is deterministic — the same session ID and turn index always produce the same sampling decision (FNV-1a hash). This ensures reproducible behavior across runs.
{ "id": "toxicity_check", "type": "contains", "trigger": "sample_turns", "sample_percentage": 10, "params": { "patterns": ["harmful", "offensive"] }}Eval Groups
Section titled “Eval Groups”Evals can belong to one or more groups, enabling selective execution. When no explicit groups are set, evals are automatically classified based on their handler type:
| Group | Value | Assigned To |
|---|---|---|
| Default | default | All evals with no explicit groups |
| Fast-running | fast-running | Deterministic checks: contains, regex, json_valid, tools_called, workflow checks, etc. |
| Long-running | long-running | Compute/network-intensive: llm_judge, cosine_similarity, outcome_equivalent, a2a_eval, rest_eval, exec handlers |
| External | external | External system calls: llm_judge, a2a_eval, rest_eval, exec handlers |
Automatic classification
Section titled “Automatic classification”Evals with no explicit groups field receive default plus one or more well-known groups. For example, a contains eval gets ["default", "fast-running"], while an llm_judge eval gets ["default", "long-running", "external"].
Explicit groups
Section titled “Explicit groups”Setting groups on an eval definition overrides the automatic classification entirely:
{ "id": "compliance_check", "type": "llm_judge", "trigger": "every_turn", "groups": ["compliance", "safety"], "params": { "criteria": "Check regulatory compliance" }}This eval will only match when filtering for compliance or safety — it will no longer match default, long-running, or external.
Filtering by group
Section titled “Filtering by group”In the SDK, use EvalGroups to select which groups to run:
// Only run fast evals in the hot pathresults, _ := sdk.Evaluate(ctx, sdk.EvaluateOpts{ PackPath: "./app.pack.json", Messages: messages, EvalGroups: []string{"fast-running"},})When EvalGroups is nil or empty, all evals run regardless of group.
Dispatch Patterns
Section titled “Dispatch Patterns”The eval system supports three dispatch patterns for different deployment scenarios:
Pattern A: InProcDispatcher
Section titled “Pattern A: InProcDispatcher”Runs evals synchronously in the same process. Used by Arena and simple SDK deployments.
Conversation ──► InProcDispatcher ──► EvalRunner ──► Handlers ──► ResultWriterPattern B: EventDispatcher
Section titled “Pattern B: EventDispatcher”Publishes eval requests to an event bus for async processing by workers. Used in production SDK deployments.
Conversation ──► EventDispatcher ──► Event Bus ──► EvalWorker ──► EvalRunner ──► ResultWriterPattern C: EventBusEvalListener
Section titled “Pattern C: EventBusEvalListener”Subscribes to EventBus message.created events and triggers evals automatically. No explicit middleware needed.
RecordingStage ──► EventBus ──► EventBusEvalListener ──► SessionAccumulator ──► Dispatcher ──► RunnerThe EventBusEvalListener uses a SessionAccumulator that accumulates messages per session and builds EvalContext on demand. Sessions expire after a configurable TTL (default: 30 minutes).
Eval Executor (Arena)
Section titled “Eval Executor (Arena)”The EvalConversationExecutor evaluates saved conversations from recordings:
- Load recording via adapter registry
- Build conversation context from recorded messages
- Apply turn-level assertions to each assistant message
- Evaluate conversation-level assertions
- Run pack session evals (if configured)
- Return aggregated results
This enables offline evaluation of historical conversations without re-running them against a live LLM.
Metrics & Prometheus
Section titled “Metrics & Prometheus”Eval results can be recorded as Prometheus metrics using the unified metrics.Collector. The same collector records both pipeline operational metrics and eval metrics into a standard prometheus.Registry.
import ( "github.com/prometheus/client_golang/prometheus" "github.com/AltairaLabs/PromptKit/runtime/metrics" "github.com/AltairaLabs/PromptKit/sdk")
reg := prometheus.NewRegistry()collector := metrics.NewCollector(metrics.CollectorOpts{ Registerer: reg, Namespace: "myapp", ConstLabels: prometheus.Labels{"env": "prod"},})
conv, _ := sdk.Open("./app.pack.json", "chat", sdk.WithMetrics(collector, nil),)When WithMetrics() is configured, all eval results are automatically recorded as Prometheus metrics alongside pipeline metrics. Evals with an explicit metric definition use that configuration; evals without one get an auto-generated gauge metric named after the eval ID. Eval metrics are namespaced under {namespace}_eval_ to distinguish them from pipeline metrics. For example, a metric named response_quality_score with namespace myapp becomes myapp_eval_response_quality_score, and an eval with ID check-tone without an explicit metric becomes myapp_eval_check-tone. See Metrics Reference for the full catalog.
Metric Types
Section titled “Metric Types”| Type | Behavior |
|---|---|
gauge | Set to the eval’s score value |
counter | Increment count on each execution |
histogram | Observe value with configurable buckets, track sum/count |
boolean | 1.0 if score ≥ 1.0, 0.0 otherwise |
Label Sources
Section titled “Label Sources”Pack-author labels are declared in the metric.labels field of each eval definition:
{ "id": "response_quality", "type": "llm_judge", "trigger": "every_turn", "metric": { "name": "response_quality_score", "type": "histogram", "range": { "min": 0, "max": 1 }, "labels": { "eval_type": "llm_judge", "category": "quality" } }, "params": { "criteria": "Rate the quality of the response" }}Const labels are set via CollectorOpts.ConstLabels — process-level dimensions (env, region) baked into the metric descriptor.
Instance labels are set via CollectorOpts.InstanceLabels and bound per-conversation — conversation-level dimensions (tenant, prompt_name).
Label names must match Prometheus naming rules (^[a-zA-Z_][a-zA-Z0-9_]*$) and must not start with __ (reserved by Prometheus). Invalid label names are caught during pack validation.
Events
Section titled “Events”Eval results emit events through the EventBus:
| Event | Constant | When |
|---|---|---|
eval.completed | EventEvalCompleted | Eval finished successfully (regardless of score) |
eval.failed | EventEvalFailed | Eval handler returned an error |
The eval.completed event carries an EvalCompletedData payload with the eval ID, type, score, and derived Passed field (IsPassed() — true when score is nil or ≥ 1.0). The eval.failed event indicates an infrastructure error (the handler itself errored), not a low score.
Pack Eval Resolution
Section titled “Pack Eval Resolution”When both pack-level and prompt-level evals are defined, they are merged:
- Prompt evals override pack evals where IDs match
- Pack-only evals are preserved
- Prompt-only evals are appended
This allows packs to define baseline evals while individual prompts customize or extend them.
Example
Section titled “Example”See the eval-test example for a working Arena configuration that evaluates saved conversations with both deterministic assertions and LLM judge evals.
See Also
Section titled “See Also”- Checks Reference — All check types and parameters
- Unified Check Model — How evals, assertions, and guardrails relate
- Run Evals — Programmatic eval execution via SDK
- Assertions Reference — Test-time checks
- Observability — EventBus architecture
- Session Recording — How recordings feed into evals
- Metrics Reference — Complete catalog of all emitted metrics
- Monitor Events — Event hooks and Prometheus metrics in SDK