Eval Framework

Understanding PromptKit’s automated evaluation system for LLM outputs.

Overview

Evals are automated quality checks that run against LLM outputs. They answer questions like “Did the assistant stay on topic?”, “Was the JSON valid?”, or “Did it call the right tools?”. Evals are defined in pack files and execute automatically during conversations or against recorded sessions.

Evals use the same check types as assertions and guardrails. The difference is when and where they run: evals can fire in production on every turn, on a sampled subset, or at session close, whereas assertions only run during Arena tests and guardrails run inline before the response is delivered.

Eval handlers produce scores only (0.0–1.0). They never determine pass/fail — that responsibility belongs to assertion and guardrail wrappers. When used as standalone evals, the score is recorded as a metric and emitted as an event.

Pack File (evals) ──► EvalRunner ──► ResultWriter ──► Metrics / Metadata

Pack Evals vs Scenario Assertions

PromptKit offers two complementary evaluation mechanisms that share the same underlying check types:

	Pack Evals	Scenario Assertions
Defined in	Pack file (`evals` array)	Arena scenario YAML
Scope	Any conversation using the pack	Specific test scenarios
When	Production + testing	Testing only
Check types	Any check from the unified catalog	Any check from the unified catalog
Trigger	Configurable (every turn, sampling, session close)	Every turn / conversation end

Pack evals travel with your pack — they run in production, in Arena tests, and anywhere the pack is used. Think of them as built-in quality monitors.

Scenario assertions are Arena-specific test expectations. They validate specific conversation flows defined in your test scenarios.

Both can coexist: pack evals provide baseline quality monitoring while scenario assertions verify specific behaviors. See Unified Check Model for how evals, assertions, and guardrails relate.

Eval Definition Structure

Each eval is an EvalDef object in the pack’s evals array. The structure combines a check type with trigger, sampling, threshold, and metric configuration:

{
  "id": "quality_check",
  "type": "contains",
  "trigger": "every_turn",
  "params": { "patterns": ["thank you"] },
  "threshold": { "min_score": 0.8 },
  "enabled": true,
  "sample_percentage": 10,
  "metric": {
    "name": "response_quality",
    "type": "gauge",
    "labels": { "category": "tone" }
  }
}

Field	Required	Description
`id`	Yes	Unique identifier for the eval within the pack
`type`	Yes	Check type from the Checks Reference
`trigger`	Yes	When the eval fires (see Triggers)
`params`	Varies	Parameters specific to the check type
`threshold`	No	Pass/fail threshold (e.g. `min_score`)
`enabled`	No	Whether the eval is active (default: `true`)
`sample_percentage`	No	Percentage of turns/sessions to evaluate (for sampling triggers)
`groups`	No	Eval groups for filtering (see Eval Groups)
`metric`	No	Prometheus metric configuration (see Metrics & Prometheus)

Triggers

Each eval specifies when it should fire:

Trigger	Description	Use Case
`every_turn`	After each assistant response	Real-time quality checks
`on_session_complete`	When session closes	Summary evaluations
`sample_turns`	Percentage of turns (hash-based)	Production sampling
`sample_sessions`	Percentage of sessions (hash-based)	Production sampling
`on_conversation_complete`	When multi-session conversation closes	Final evaluation
`on_workflow_step`	After a workflow state transition	Workflow validation

Sampling is deterministic — the same session ID and turn index always produce the same sampling decision (FNV-1a hash). This ensures reproducible behavior across runs.

{
  "id": "toxicity_check",
  "type": "contains",
  "trigger": "sample_turns",
  "sample_percentage": 10,
  "params": {
    "patterns": ["harmful", "offensive"]
  }
}

Eval Groups

Evals can belong to one or more groups, enabling selective execution. When no explicit groups are set, evals are automatically classified based on their handler type:

Group	Value	Assigned To
Default	`default`	All evals with no explicit groups
Fast-running	`fast-running`	Deterministic checks: `contains`, `regex`, `json_valid`, `tools_called`, workflow checks, etc.
Long-running	`long-running`	Compute/network-intensive: `llm_judge`, `cosine_similarity`, `outcome_equivalent`, `a2a_eval`, `rest_eval`, exec handlers
External	`external`	External system calls: `llm_judge`, `a2a_eval`, `rest_eval`, exec handlers

Automatic classification

Evals with no explicit groups field receive default plus one or more well-known groups. For example, a contains eval gets ["default", "fast-running"], while an llm_judge eval gets ["default", "long-running", "external"].

Explicit groups

Setting groups on an eval definition overrides the automatic classification entirely:

{
  "id": "compliance_check",
  "type": "llm_judge",
  "trigger": "every_turn",
  "groups": ["compliance", "safety"],
  "params": { "criteria": "Check regulatory compliance" }
}

This eval will only match when filtering for compliance or safety — it will no longer match default, long-running, or external.

Filtering by group

In the SDK, use EvalGroups to select which groups to run:

// Only run fast evals in the hot path
results, _ := sdk.Evaluate(ctx, sdk.EvaluateOpts{
    PackPath:   "./app.pack.json",
    Messages:   messages,
    EvalGroups: []string{"fast-running"},
})

When EvalGroups is nil or empty, all evals run regardless of group.

Dispatch Patterns

The eval system supports three dispatch patterns for different deployment scenarios:

Pattern A: InProcDispatcher

Runs evals synchronously in the same process. Used by Arena and simple SDK deployments.

Conversation ──► InProcDispatcher ──► EvalRunner ──► Handlers ──► ResultWriter

Pattern B: EventDispatcher

Publishes eval requests to an event bus for async processing by workers. Used in production SDK deployments.

Conversation ──► EventDispatcher ──► Event Bus ──► EvalWorker ──► EvalRunner ──► ResultWriter

Pattern C: EventBusEvalListener

Subscribes to EventBus message.created events and triggers evals automatically. No explicit middleware needed.

RecordingStage ──► EventBus ──► EventBusEvalListener ──► SessionAccumulator ──► Dispatcher ──► Runner

The EventBusEvalListener uses a SessionAccumulator that accumulates messages per session and builds EvalContext on demand. Sessions expire after a configurable TTL (default: 30 minutes).

Eval Executor (Arena)

The EvalConversationExecutor evaluates saved conversations from recordings:

Load recording via adapter registry
Build conversation context from recorded messages
Apply turn-level assertions to each assistant message
Evaluate conversation-level assertions
Run pack session evals (if configured)
Return aggregated results

This enables offline evaluation of historical conversations without re-running them against a live LLM.

Metrics & Prometheus

Eval results can be recorded as Prometheus metrics using the unified metrics.Collector. The same collector records both pipeline operational metrics and eval metrics into a standard prometheus.Registry.

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/AltairaLabs/PromptKit/runtime/metrics"
    "github.com/AltairaLabs/PromptKit/sdk"
)

reg := prometheus.NewRegistry()
collector := metrics.NewCollector(metrics.CollectorOpts{
    Registerer:  reg,
    Namespace:   "myapp",
    ConstLabels: prometheus.Labels{"env": "prod"},
})

conv, _ := sdk.Open("./app.pack.json", "chat",
    sdk.WithMetrics(collector, nil),
)

When WithMetrics() is configured, all eval results are automatically recorded as Prometheus metrics alongside pipeline metrics. Evals with an explicit metric definition use that configuration; evals without one get an auto-generated gauge metric named after the eval ID. Eval metrics are namespaced under {namespace}_eval_ to distinguish them from pipeline metrics. For example, a metric named response_quality_score with namespace myapp becomes myapp_eval_response_quality_score, and an eval with ID check-tone without an explicit metric becomes myapp_eval_check-tone. See Metrics Reference for the full catalog.

Metric Types

Type	Behavior
`gauge`	Set to the eval’s score value
`counter`	Increment count on each execution
`histogram`	Observe value with configurable buckets, track sum/count
`boolean`	1.0 if score ≥ 1.0, 0.0 otherwise

Label Sources

Pack-author labels are declared in the metric.labels field of each eval definition:

{
  "id": "response_quality",
  "type": "llm_judge",
  "trigger": "every_turn",
  "metric": {
    "name": "response_quality_score",
    "type": "histogram",
    "range": { "min": 0, "max": 1 },
    "labels": {
      "eval_type": "llm_judge",
      "category": "quality"
    }
  },
  "params": {
    "criteria": "Rate the quality of the response"
  }
}

Const labels are set via CollectorOpts.ConstLabels — process-level dimensions (env, region) baked into the metric descriptor.

Instance labels are set via CollectorOpts.InstanceLabels and bound per-conversation — conversation-level dimensions (tenant, prompt_name).

Label names must match Prometheus naming rules (^[a-zA-Z_][a-zA-Z0-9_]*$) and must not start with __ (reserved by Prometheus). Invalid label names are caught during pack validation.

Events

Eval results emit events through the EventBus:

Event	Constant	When
`eval.completed`	`EventEvalCompleted`	Eval finished successfully (regardless of score)
`eval.failed`	`EventEvalFailed`	Eval handler returned an error

The eval.completed event carries an EvalCompletedData payload with the eval ID, type, score, and derived Passed field (IsPassed() — true when score is nil or ≥ 1.0). The eval.failed event indicates an infrastructure error (the handler itself errored), not a low score.

Pack Eval Resolution

When both pack-level and prompt-level evals are defined, they are merged:

Prompt evals override pack evals where IDs match
Pack-only evals are preserved
Prompt-only evals are appended

This allows packs to define baseline evals while individual prompts customize or extend them.

Example

See the eval-test example for a working Arena configuration that evaluates saved conversations with both deterministic assertions and LLM judge evals.