Skip to content

Run Evals

Run quality checks against conversation snapshots with sdk.Evaluate().

import (
"context"
"fmt"
"github.com/AltairaLabs/PromptKit/sdk"
)
results, err := sdk.Evaluate(ctx, sdk.EvaluateOpts{
PackPath: "./app.pack.json",
PromptName: "assistant",
Messages: messages, // []types.Message
SessionID: "session-123",
TurnIndex: 0,
})
if err != nil {
log.Fatal(err)
}
for _, r := range results {
fmt.Printf("%s: score=%v explanation=%s\n", r.EvalID, r.Score, r.Explanation)
}

No live provider connection is needed — just messages in, results out.

Provide eval definitions from one of three sources (checked in order):

results, _ := sdk.Evaluate(ctx, sdk.EvaluateOpts{
PackPath: "./app.pack.json",
PromptName: "assistant", // merge prompt-level evals with pack-level
Messages: messages,
})
results, _ := sdk.Evaluate(ctx, sdk.EvaluateOpts{
PackData: packJSON, // []byte
PromptName: "assistant",
Messages: messages,
})
import "github.com/AltairaLabs/PromptKit/runtime/evals"
results, _ := sdk.Evaluate(ctx, sdk.EvaluateOpts{
EvalDefs: []evals.EvalDef{
{
ID: "no_profanity",
Type: "content_excludes",
Trigger: evals.TriggerEveryTurn,
Params: map[string]any{"patterns": []string{"damn", "hell"}},
},
{
ID: "valid_json",
Type: "json_valid",
Trigger: evals.TriggerEveryTurn,
},
},
Messages: messages,
})

Control when evals fire:

TriggerConstantDescription
every_turnevals.TriggerEveryTurnAfter each assistant response (default)
on_session_completeevals.TriggerOnSessionCompleteWhen session ends
on_conversation_completeevals.TriggerOnConversationCompleteWhen conversation ends
sample_turnsevals.TriggerSampleTurnsHash-based turn sampling
sample_sessionsevals.TriggerSampleSessionsHash-based session sampling
on_workflow_stepevals.TriggerOnWorkflowStepAfter workflow transition
results, _ := sdk.Evaluate(ctx, sdk.EvaluateOpts{
PackPath: "./app.pack.json",
Messages: messages,
Trigger: evals.TriggerOnSessionComplete,
})

For llm_judge and llm_judge_session evals, provide a judge provider:

results, _ := sdk.Evaluate(ctx, sdk.EvaluateOpts{
PackPath: "./app.pack.json",
Messages: messages,
JudgeProvider: judgeProvider, // pre-built provider instance
})

Evals are automatically classified into well-known groups based on their handler type. When no explicit groups are configured, each eval belongs to default plus a classification group:

GroupConstantDescription
defaultevals.DefaultEvalGroupAll evals with no explicit groups
fast-runningevals.GroupFastRunningDeterministic checks (string matching, regex, JSON validation)
long-runningevals.GroupLongRunningLLM calls, embeddings, network requests
externalevals.GroupExternalExternal systems (REST APIs, A2A agents, exec subprocesses)

Filter which groups to run with EvalGroups:

// Only run fast, deterministic evals
results, _ := sdk.Evaluate(ctx, sdk.EvaluateOpts{
PackPath: "./app.pack.json",
Messages: messages,
EvalGroups: []string{evals.GroupFastRunning},
})

Override automatic classification by setting explicit groups on an eval definition:

{
"id": "custom_check",
"type": "llm_judge",
"trigger": "every_turn",
"groups": ["safety", "compliance"],
"params": { "criteria": "..." }
}

When explicit groups are set, they fully replace the automatic classification.

Record eval results as Prometheus metrics by passing a MetricsCollector — the SDK calls Bind() internally, matching the WithMetrics() pattern from the conversation API:

import (
"github.com/prometheus/client_golang/prometheus"
"github.com/AltairaLabs/PromptKit/runtime/metrics"
)
reg := prometheus.NewRegistry()
collector := metrics.NewEvalOnlyCollector(metrics.CollectorOpts{
Registerer: reg,
Namespace: "myapp",
InstanceLabels: []string{"tenant"},
})
results, _ := sdk.Evaluate(ctx, sdk.EvaluateOpts{
PackPath: "./app.pack.json",
Messages: messages,
MetricsCollector: collector,
MetricsInstanceLabels: map[string]string{"tenant": "acme"},
})

NewEvalOnlyCollector skips pipeline metric registration — use it for standalone eval workers that don’t run a live pipeline. For consumers that also need pipeline metrics, use metrics.NewCollector() instead.

You can also pass a raw MetricRecorder for custom implementations, but MetricsCollector is preferred for new code.

Evals must have a metric definition in the pack to be recorded. See Metrics & Prometheus for metric types and label configuration, and Monitor Events for the full metrics reference.

Use ValidateEvalTypes() as a preflight check to ensure all eval types have registered handlers:

missing, err := sdk.ValidateEvalTypes(sdk.ValidateEvalTypesOpts{
PackPath: "./app.pack.json",
RuntimeConfigPath: "./runtime-config.yaml", // registers exec handlers
})
if len(missing) > 0 {
for _, def := range missing {
log.Printf("missing handler for eval %q (type: %s)", def.ID, def.Type)
}
}

This catches configuration errors (typos, missing RuntimeConfig bindings) at startup or in CI before evals are actually executed.

results, _ := sdk.Evaluate(ctx, sdk.EvaluateOpts{
PackPath: "./app.pack.json",
Messages: messages,
TracerProvider: tp, // trace.TracerProvider
})

Each eval result emits a span named promptkit.eval.{evalID}.

bus := events.NewEventBus()
defer bus.Close()
bus.Subscribe(events.EventEvalCompleted, func(e *events.Event) {
log.Printf("Eval passed: %s", e.Data)
})
results, _ := sdk.Evaluate(ctx, sdk.EvaluateOpts{
PackPath: "./app.pack.json",
Messages: messages,
EventBus: bus,
})
FieldTypeDescription
PackPathstringLoad pack from filesystem
PackData[]byteParse pack from JSON bytes
EvalDefs[]evals.EvalDefPre-resolved eval definitions
PromptNamestringSelect prompt-level evals to merge
Messages[]types.MessageConversation history to evaluate
SessionIDstringSession ID for sampling determinism
TurnIndexintCurrent turn index (0-based)
EvalGroups[]stringFilter evals by group (default: all)
Triggerevals.EvalTriggerTrigger filter (default: every_turn)
JudgeProvideranyPre-built LLM judge provider
JudgeTargetsmap[string]anyProvider specs for LLM judge evals
TracerProvidertrace.TracerProviderOpenTelemetry tracing
EventBusevents.BusEvent emission
Logger*slog.LoggerStructured logging
RuntimeConfigPathstringLoad exec eval handlers from RuntimeConfig YAML
MetricsCollector*metrics.CollectorUnified Prometheus collector (preferred — SDK calls Bind() internally)
MetricsInstanceLabelsmap[string]stringPer-invocation label values for MetricsCollector
MetricRecorderevals.MetricRecorderCustom metric recorder (use MetricsCollector for new code)
Registry*evals.EvalTypeRegistryCustom handler registry
Timeouttime.DurationPer-eval timeout (default: 30s)
SkipSchemaValidationboolSkip JSON schema validation

Each result contains:

type EvalResult struct {
EvalID string // Eval identifier
Type string // Handler type
Score *float64 // Score (0.0-1.0)
Explanation string // Human-readable explanation
DurationMs int64 // Execution time
Error string // Error message if eval errored
Violations []EvalViolation // Detailed violations
Skipped bool // Was eval skipped?
SkipReason string // Why skipped
Passed bool // Deprecated: set only by assertion/guardrail wrappers
}

Eval handlers produce scores only. Use result.IsPassed() to derive pass/fail from the score (true when score is nil or ≥ 1.0). The Passed field is deprecated for standalone evals — it is only set explicitly by AssertionEvalHandler and GuardrailEvalHandler wrappers.