Skip to content

Metrics Reference

Complete reference for all Prometheus metrics emitted by PromptKit’s unified metrics.Collector.

PromptKit emits two categories of metrics through a single metrics.Collector:

  • Pipeline metrics — operational metrics recorded automatically from EventBus events (provider calls, tool calls, pipeline duration, validation checks)
  • Eval metrics — quality metrics recorded from pack-defined EvalDef.Metric definitions

All metrics share a common label structure:

{namespace}_{metric_name}{const_labels, instance_labels, event_labels}

Eval metrics use a separate sub-namespace to distinguish them from pipeline metrics:

{namespace}_eval_{metric_name}{const_labels, instance_labels, pack_labels}

Where:

  • Namespace — configurable prefix (default: promptkit)
  • Const labels — process-level labels baked into the metric descriptor (env, region)
  • Instance labels — per-conversation labels bound via Bind() (tenant, prompt_name)
  • Event labels — per-observation labels specific to each metric (listed in the tables below)

These are registered at Collector creation time and recorded automatically when MetricContext.OnEvent is wired to the EventBus. Disable with DisablePipelineMetrics: true or use NewEvalOnlyCollector().

MetricTypeEvent LabelsDescription
{ns}_pipeline_duration_secondsHistogramstatusTotal pipeline execution duration

status values: success, error

Buckets: 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 120 seconds

Source events: pipeline.completed, pipeline.failed

MetricTypeEvent LabelsDescription
{ns}_provider_request_duration_secondsHistogramprovider, modelLLM API call duration
{ns}_provider_requests_totalCounterprovider, model, statusTotal provider API calls
{ns}_provider_input_tokens_totalCounterprovider, modelInput tokens sent to provider
{ns}_provider_output_tokens_totalCounterprovider, modelOutput tokens received from provider
{ns}_provider_cached_tokens_totalCounterprovider, modelCached tokens in provider calls
{ns}_provider_cost_totalCounterprovider, modelTotal cost in USD

status values: success, error

Duration buckets: 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60 seconds

Source events: provider.call.completed, provider.call.failed

Notes:

  • Token metrics are only incremented when the count is > 0
  • Cost is only incremented when > 0
  • Duration is recorded on both success and failure
MetricTypeEvent LabelsDescription
{ns}_tool_call_duration_secondsHistogramtoolTool call execution duration
{ns}_tool_calls_totalCountertool, statusTotal tool call count

status values: success, error

Buckets: 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10 seconds

Source events: tool.call.completed, tool.call.failed

MetricTypeEvent LabelsDescription
{ns}_validation_duration_secondsHistogramvalidator, validator_typeValidation check duration
{ns}_validations_totalCountervalidator, validator_type, statusValidation results

status values: passed, failed

Buckets: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1 seconds

Source events: validation.passed, validation.failed

Eval metrics are registered dynamically on first observation (with double-checked locking for thread safety). Disable with DisableEvalMetrics: true.

Every eval that runs produces a Prometheus metric. If the eval definition includes an explicit metric field, that definition is used. If no metric field is present, a default gauge metric is auto-generated using the eval ID as the metric name (e.g., eval "response-quality" becomes {ns}_eval_response-quality). This ensures pack authors don’t need to opt in to metrics — every eval result is observable by default.

{
"id": "response_quality",
"type": "llm_judge",
"trigger": "every_turn",
"metric": {
"name": "response_quality_score",
"type": "gauge",
"labels": {
"eval_type": "llm_judge",
"category": "quality"
}
}
}
FieldRequiredDescription
nameYesPrometheus metric name (auto-prefixed with {namespace}_eval_ if not already)
typeYesOne of gauge, counter, histogram, boolean
rangeNoValue range hint (min, max) — used for documentation, not enforced
labelsNoStatic labels added to this metric (pack-author defined)
TypePrometheus TypeBehaviorTypical Use
gaugeGaugeSet to the eval’s score valueRelevance scores, quality ratings
counterCounterIncrement by 1 on each eval executionExecution counts
histogramHistogramObserve the score valueScore distributions
booleanGaugeSet to 1.0 (pass) or 0.0 (fail)Binary checks (JSON valid, contains keyword)

Histogram buckets: Prometheus default buckets (0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10)

The full label set for an eval metric is: instance labels (sorted) + pack-author labels (sorted by key).

For example, with InstanceLabels: ["tenant"] and metric.labels: {"category": "quality", "eval_type": "llm_judge"}:

myapp_eval_response_quality_score{tenant="acme",category="quality",eval_type="llm_judge"} 0.85

The score value recorded for gauge and histogram types is extracted by ExtractValue as follows:

  1. If EvalResult.MetricValue is non-nil, use *MetricValue
  2. Otherwise, if EvalResult.Score is non-nil, use *Score
  3. Otherwise, default to 0.0

For boolean metrics, the value is 1.0 if Score >= 1.0, otherwise 0.0.

Labels come from three sources, applied in order:

LevelSet AtScopeExamples
Const labelsCollectorOpts.ConstLabelsProcess-wide, baked into descriptorenv, region, service_name
Instance labelsBind(map) / MetricsInstanceLabelsPer-conversation or per-invocationtenant, prompt_name
Event labelsPer-observation (automatic)Per-metric-observationprovider, model, status, tool

InstanceLabels are sorted alphabetically when the Collector is created. When calling Bind(), the map key order doesn’t matter — values are looked up by key name, not position.

// These produce identical results:
collector.Bind(map[string]string{"z_tenant": "acme", "a_prompt": "chat"})
collector.Bind(map[string]string{"a_prompt": "chat", "z_tenant": "acme"})
reg := prometheus.NewRegistry()
collector := metrics.NewCollector(metrics.CollectorOpts{
Registerer: reg,
Namespace: "myapp",
ConstLabels: prometheus.Labels{"env": "prod"},
InstanceLabels: []string{"tenant"},
})
conv, _ := sdk.Open("./app.pack.json", "chat",
sdk.WithMetrics(collector, map[string]string{"tenant": "acme"}),
)

Records both pipeline and eval metrics automatically.

collector := metrics.NewEvalOnlyCollector(metrics.CollectorOpts{
Registerer: reg,
Namespace: "myapp",
InstanceLabels: []string{"tenant"},
})
results, _ := sdk.Evaluate(ctx, sdk.EvaluateOpts{
PackPath: "./app.pack.json",
Messages: messages,
MetricsCollector: collector,
MetricsInstanceLabels: map[string]string{"tenant": "acme"},
})
FieldTypeDefaultDescription
Registererprometheus.RegistererDefaultRegistererRegistry to register into
Namespacestring"promptkit"Metric name prefix
ConstLabelsprometheus.LabelsnilProcess-level constant labels
InstanceLabels[]stringnilLabel names that vary per conversation (sorted internally)
DisablePipelineMetricsboolfalseSkip pipeline metric registration
DisableEvalMetricsboolfalseSkip eval metric recording
FunctionDescription
NewCollector(opts)Full collector — pipeline + eval metrics
NewEvalOnlyCollector(opts)Eval metrics only (DisablePipelineMetrics: true)

For quick reference, here is every metric name emitted with the default promptkit namespace:

Metric NameTypeCategory
promptkit_pipeline_duration_secondsHistogramPipeline
promptkit_provider_request_duration_secondsHistogramProvider
promptkit_provider_requests_totalCounterProvider
promptkit_provider_input_tokens_totalCounterProvider
promptkit_provider_output_tokens_totalCounterProvider
promptkit_provider_cached_tokens_totalCounterProvider
promptkit_provider_cost_totalCounterProvider
promptkit_tool_call_duration_secondsHistogramTool
promptkit_tool_calls_totalCounterTool
promptkit_validation_duration_secondsHistogramValidation
promptkit_validations_totalCounterValidation
{ns}_eval_{metric_name}VariesEval (explicit pack-defined metric)
{ns}_eval_{eval_id}GaugeEval (auto-generated when no metric defined)

PromptKit metrics and traces are correlated through the session ID. The session ID (a UUID) appears as:

  • Metrics: instance label (e.g., session_id="4e597ba3-92bf-47cf-84f3-29d3ece24456")
  • Traces: gen_ai.conversation.id span attribute
  • Events: Event.SessionID field

The OTel trace ID equals the session ID with dashes removed (e.g., session 4e597ba3-92bf-47cf-84f3-29d3ece24456 → trace ID 4e597ba392bf47cf84f329d3ece24456), so a single session ID query correlates logs, metrics, and traces.

PromptKit’s built-in Collector does not attach Prometheus exemplars to observations. This is intentional — exemplar configuration (trace ID format, label keys, sampling) is an operator concern.

Operators who want exemplar support (e.g., clicking from a Grafana metric panel to a specific trace in Tempo) can subscribe their own listener to the EventBus and record metrics with exemplars:

bus.SubscribeAll(func(event *events.Event) {
if event.Type != events.EventPipelineCompleted {
return
}
data := event.Data.(*events.PipelineCompletedData)
// Derive trace ID from session ID (remove dashes).
traceID := strings.ReplaceAll(event.SessionID, "-", "")
// Record with exemplar for Grafana → Tempo linking.
hist, _ := pipelineDuration.GetMetricWithLabelValues("success")
hist.(prometheus.ExemplarObserver).ObserveWithExemplar(
data.Duration.Seconds(),
prometheus.Labels{"trace_id": traceID},
)
})

This approach gives operators full control over which metrics carry exemplars and how trace IDs are derived.