Eval Configuration Test Example
This example demonstrates the new Eval configuration type for evaluating saved conversations.
Overview
Section titled “Overview”The Eval config type allows you to:
- Load saved conversations from recording files
- Specify judge targets for LLM-based assertions
- Define assertions to validate the conversation
- Categorize evaluations with tags
config.arena.yaml- Main arena configuration with eval referenceevals/basic-eval.eval.yaml- Eval configuration for testingproviders/replay.provider.yaml- Replay provider for deterministic playback
# From repo root, validate the eval configPROMPTKIT_SCHEMA_SOURCE=local ./bin/promptarena validate examples/eval-test/config.arena.yaml
# Validate the eval file directlyPROMPTKIT_SCHEMA_SOURCE=local ./bin/promptarena validate --type eval examples/eval-test/evals/basic-eval.eval.yamlNote: Use PROMPTKIT_SCHEMA_SOURCE=local during development until schemas are published to the hosted location.
Eval Configuration Format
Section titled “Eval Configuration Format”apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Evalmetadata: name: basic-evalspec: id: basic-eval-test description: Test evaluation of a saved conversation recording: path: path/to/recording.json type: session # session, arena_output, transcript, or generic judge_targets: default: type: openai model: gpt-4o id: gpt-4o-judge assertions: - type: llm_judge params: judge: default criteria: "Your evaluation criteria here" expected: pass tags: - test - category mode: instant # instant, realtime, or acceleratedRecording Types
Section titled “Recording Types”session: Session recording JSON (.recording.json)arena_output: Arena output JSON from previous runstranscript: Transcript YAML (.transcript.yaml)generic: Generic chat export JSON
Integration with Issue #215
Section titled “Integration with Issue #215”This implementation provides the foundation for:
- Issue #215: Eval config type support ✅
- Issue #216: Recording adapter system (future)
- Issue #217: Replay provider enhancements (future)
Next Steps
Section titled “Next Steps”Future enhancements will add:
- Recording adapter registry for multiple formats
- Metadata propagation from recordings to judges
- Multimodal content pass-through in replay