Skip to content

Eval Framework

Understanding PromptKit’s automated evaluation system for LLM outputs.

Evals are automated quality checks that run against LLM outputs. They answer questions like “Did the assistant stay on topic?”, “Was the JSON valid?”, or “Did it call the right tools?”. Evals are defined in pack files and execute automatically during conversations or against recorded sessions.

Evals use the same check types as assertions and guardrails. The difference is when and where they run: evals can fire in production on every turn, on a sampled subset, or at session close, whereas assertions only run during Arena tests and guardrails run inline before the response is delivered.

Eval handlers produce scores only (0.0–1.0). They never determine pass/fail — that responsibility belongs to assertion and guardrail wrappers. When used as standalone evals, the score is recorded as a metric and emitted as an event.

Pack File (evals) ──► EvalRunner ──► ResultWriter ──► Metrics / Metadata

PromptKit offers two complementary evaluation mechanisms that share the same underlying check types:

Pack EvalsScenario Assertions
Defined inPack file (evals array)Arena scenario YAML
ScopeAny conversation using the packSpecific test scenarios
WhenProduction + testingTesting only
Check typesAny check from the unified catalogAny check from the unified catalog
TriggerConfigurable (every turn, sampling, session close)Every turn / conversation end

Pack evals travel with your pack — they run in production, in Arena tests, and anywhere the pack is used. Think of them as built-in quality monitors.

Scenario assertions are Arena-specific test expectations. They validate specific conversation flows defined in your test scenarios.

Both can coexist: pack evals provide baseline quality monitoring while scenario assertions verify specific behaviors. See Unified Check Model for how evals, assertions, and guardrails relate.

Each eval is an EvalDef object in the pack’s evals array. The structure combines a check type with trigger, sampling, threshold, and metric configuration:

{
"id": "quality_check",
"type": "contains",
"trigger": "every_turn",
"params": { "patterns": ["thank you"] },
"threshold": { "min_score": 0.8 },
"enabled": true,
"sample_percentage": 10,
"metric": {
"name": "response_quality",
"type": "gauge",
"labels": { "category": "tone" }
}
}
FieldRequiredDescription
idYesUnique identifier for the eval within the pack
typeYesCheck type from the Checks Reference
triggerYesWhen the eval fires (see Triggers)
paramsVariesParameters specific to the check type
thresholdNoPass/fail threshold (e.g. min_score)
enabledNoWhether the eval is active (default: true)
sample_percentageNoPercentage of turns/sessions to evaluate (for sampling triggers)
groupsNoEval groups for filtering (see Eval Groups)
metricNoPrometheus metric configuration (see Metrics & Prometheus)

Each eval specifies when it should fire:

TriggerDescriptionUse Case
every_turnAfter each assistant responseReal-time quality checks
on_session_completeWhen session closesSummary evaluations
sample_turnsPercentage of turns (hash-based)Production sampling
sample_sessionsPercentage of sessions (hash-based)Production sampling
on_conversation_completeWhen multi-session conversation closesFinal evaluation
on_workflow_stepAfter a workflow state transitionWorkflow validation

Sampling is deterministic — the same session ID and turn index always produce the same sampling decision (FNV-1a hash). This ensures reproducible behavior across runs.

{
"id": "toxicity_check",
"type": "contains",
"trigger": "sample_turns",
"sample_percentage": 10,
"params": {
"patterns": ["harmful", "offensive"]
}
}

Evals can belong to one or more groups, enabling selective execution. When no explicit groups are set, evals are automatically classified based on their handler type:

GroupValueAssigned To
DefaultdefaultAll evals with no explicit groups
Fast-runningfast-runningDeterministic checks: contains, regex, json_valid, tools_called, workflow checks, etc.
Long-runninglong-runningCompute/network-intensive: llm_judge, cosine_similarity, outcome_equivalent, a2a_eval, rest_eval, exec handlers
ExternalexternalExternal system calls: llm_judge, a2a_eval, rest_eval, exec handlers

Evals with no explicit groups field receive default plus one or more well-known groups. For example, a contains eval gets ["default", "fast-running"], while an llm_judge eval gets ["default", "long-running", "external"].

Setting groups on an eval definition overrides the automatic classification entirely:

{
"id": "compliance_check",
"type": "llm_judge",
"trigger": "every_turn",
"groups": ["compliance", "safety"],
"params": { "criteria": "Check regulatory compliance" }
}

This eval will only match when filtering for compliance or safety — it will no longer match default, long-running, or external.

In the SDK, use EvalGroups to select which groups to run:

// Only run fast evals in the hot path
results, _ := sdk.Evaluate(ctx, sdk.EvaluateOpts{
PackPath: "./app.pack.json",
Messages: messages,
EvalGroups: []string{"fast-running"},
})

When EvalGroups is nil or empty, all evals run regardless of group.

The eval system supports three dispatch patterns for different deployment scenarios:

Runs evals synchronously in the same process. Used by Arena and simple SDK deployments.

Conversation ──► InProcDispatcher ──► EvalRunner ──► Handlers ──► ResultWriter

Publishes eval requests to an event bus for async processing by workers. Used in production SDK deployments.

Conversation ──► EventDispatcher ──► Event Bus ──► EvalWorker ──► EvalRunner ──► ResultWriter

Subscribes to EventBus message.created events and triggers evals automatically. No explicit middleware needed.

RecordingStage ──► EventBus ──► EventBusEvalListener ──► SessionAccumulator ──► Dispatcher ──► Runner

The EventBusEvalListener uses a SessionAccumulator that accumulates messages per session and builds EvalContext on demand. Sessions expire after a configurable TTL (default: 30 minutes).

The EvalConversationExecutor evaluates saved conversations from recordings:

  1. Load recording via adapter registry
  2. Build conversation context from recorded messages
  3. Apply turn-level assertions to each assistant message
  4. Evaluate conversation-level assertions
  5. Run pack session evals (if configured)
  6. Return aggregated results

This enables offline evaluation of historical conversations without re-running them against a live LLM.

Eval results can be recorded as Prometheus metrics using the unified metrics.Collector. The same collector records both pipeline operational metrics and eval metrics into a standard prometheus.Registry.

import (
"github.com/prometheus/client_golang/prometheus"
"github.com/AltairaLabs/PromptKit/runtime/metrics"
"github.com/AltairaLabs/PromptKit/sdk"
)
reg := prometheus.NewRegistry()
collector := metrics.NewCollector(metrics.CollectorOpts{
Registerer: reg,
Namespace: "myapp",
ConstLabels: prometheus.Labels{"env": "prod"},
})
conv, _ := sdk.Open("./app.pack.json", "chat",
sdk.WithMetrics(collector, nil),
)

When WithMetrics() is configured, all eval results are automatically recorded as Prometheus metrics alongside pipeline metrics. Evals with an explicit metric definition use that configuration; evals without one get an auto-generated gauge metric named after the eval ID. Eval metrics are namespaced under {namespace}_eval_ to distinguish them from pipeline metrics. For example, a metric named response_quality_score with namespace myapp becomes myapp_eval_response_quality_score, and an eval with ID check-tone without an explicit metric becomes myapp_eval_check-tone. See Metrics Reference for the full catalog.

TypeBehavior
gaugeSet to the eval’s score value
counterIncrement count on each execution
histogramObserve value with configurable buckets, track sum/count
boolean1.0 if score ≥ 1.0, 0.0 otherwise

Pack-author labels are declared in the metric.labels field of each eval definition:

{
"id": "response_quality",
"type": "llm_judge",
"trigger": "every_turn",
"metric": {
"name": "response_quality_score",
"type": "histogram",
"range": { "min": 0, "max": 1 },
"labels": {
"eval_type": "llm_judge",
"category": "quality"
}
},
"params": {
"criteria": "Rate the quality of the response"
}
}

Const labels are set via CollectorOpts.ConstLabels — process-level dimensions (env, region) baked into the metric descriptor.

Instance labels are set via CollectorOpts.InstanceLabels and bound per-conversation — conversation-level dimensions (tenant, prompt_name).

Label names must match Prometheus naming rules (^[a-zA-Z_][a-zA-Z0-9_]*$) and must not start with __ (reserved by Prometheus). Invalid label names are caught during pack validation.

Eval results emit events through the EventBus:

EventConstantWhen
eval.completedEventEvalCompletedEval finished successfully (regardless of score)
eval.failedEventEvalFailedEval handler returned an error

The eval.completed event carries an EvalCompletedData payload with the eval ID, type, score, and derived Passed field (IsPassed() — true when score is nil or ≥ 1.0). The eval.failed event indicates an infrastructure error (the handler itself errored), not a low score.

When both pack-level and prompt-level evals are defined, they are merged:

  1. Prompt evals override pack evals where IDs match
  2. Pack-only evals are preserved
  3. Prompt-only evals are appended

This allows packs to define baseline evals while individual prompts customize or extend them.

See the eval-test example for a working Arena configuration that evaluates saved conversations with both deterministic assertions and LLM judge evals.