Skip to content

Session Recording

Learn how to capture detailed session recordings for debugging, replay, and analysis of Arena test runs.

  • Deep debugging: See exact event sequences and timing
  • Audio reconstruction: Export voice conversations as WAV files
  • Replay capability: Recreate test runs with deterministic providers
  • Performance analysis: Analyze per-event timing and latency
  • Annotation support: Add labels, scores, and comments to recordings

Add the recording configuration to your arena config:

# arena.yaml
apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Arena
metadata:
name: my-arena
spec:
defaults:
output:
dir: out
formats: [json, html]
recording:
enabled: true # Enable recording
dir: recordings # Subdirectory for recordings

Run your tests normally:

Terminal window
promptarena run --scenario my-test

Recordings are saved to out/recordings/ as JSONL files.

Each recording is a JSONL file containing:

  1. Metadata line: Session info, timing, provider details
  2. Event lines: Individual events with timestamps and data
{"type":"metadata","session_id":"run-123","start_time":"2024-01-15T10:30:00Z","duration":"5.2s",...}
{"type":"event","timestamp":"2024-01-15T10:30:00.100Z","event_type":"conversation.started",...}
{"type":"event","timestamp":"2024-01-15T10:30:00.500Z","event_type":"message.created",...}
{"type":"event","timestamp":"2024-01-15T10:30:01.200Z","event_type":"audio.input",...}

Session recordings capture a comprehensive event stream:

Event TypeDescription
conversation.startedSession initialization with system prompt
message.createdUser and assistant messages
audio.inputUser audio chunks (voice conversations)
audio.outputAssistant audio chunks (voice responses)
provider.call.startedLLM API call initiation
provider.call.completedLLM API call completion with tokens/cost
tool.call.startedTool/function call initiation
tool.call.completedTool result with timing
validation.*Validator execution and results
package main
import (
"fmt"
"github.com/AltairaLabs/PromptKit/runtime/recording"
)
func main() {
// Load recording
rec, err := recording.Load("out/recordings/run-123.jsonl")
if err != nil {
panic(err)
}
// Print metadata
fmt.Printf("Session: %s\n", rec.Metadata.SessionID)
fmt.Printf("Duration: %v\n", rec.Metadata.Duration)
fmt.Printf("Events: %d\n", rec.Metadata.EventCount)
fmt.Printf("Provider: %s\n", rec.Metadata.ProviderName)
fmt.Printf("Model: %s\n", rec.Metadata.Model)
// Iterate events
for _, event := range rec.Events {
fmt.Printf("[%v] %s\n", event.Offset, event.Type)
}
}

For voice conversations, extract audio tracks as WAV files:

// Create replay player
player, err := recording.NewReplayPlayer(rec)
if err != nil {
panic(err)
}
timeline := player.Timeline()
// Export user audio
if timeline.HasTrack(events.TrackAudioInput) {
err := timeline.ExportAudioToWAV(events.TrackAudioInput, "user_audio.wav")
if err != nil {
fmt.Printf("Export failed: %v\n", err)
}
}
// Export assistant audio
if timeline.HasTrack(events.TrackAudioOutput) {
err := timeline.ExportAudioToWAV(events.TrackAudioOutput, "assistant_audio.wav")
if err != nil {
fmt.Printf("Export failed: %v\n", err)
}
}

Use the ReplayPlayer for time-synchronized event access:

player, _ := recording.NewReplayPlayer(rec)
// Seek to specific position
player.Seek(2 * time.Second)
// Get state at current position
state := player.GetState()
fmt.Printf("Position: %s\n", player.FormatPosition())
fmt.Printf("Current events: %d\n", len(state.CurrentEvents))
fmt.Printf("Messages so far: %d\n", len(state.Messages))
fmt.Printf("Audio input active: %v\n", state.AudioInputActive)
fmt.Printf("Audio output active: %v\n", state.AudioOutputActive)
// Advance through recording
for {
events := player.Advance(100 * time.Millisecond)
if player.Position() >= player.Duration() {
break
}
for _, e := range events {
fmt.Printf("[%v] %s\n", e.Offset, e.Type)
}
}

Attach annotations to recordings for review and analysis:

import "github.com/AltairaLabs/PromptKit/runtime/annotations"
// Create annotations
anns := []*annotations.Annotation{
{
ID: "quality-1",
Type: annotations.TypeScore,
SessionID: rec.Metadata.SessionID,
Target: annotations.ForSession(),
Key: "overall_quality",
Value: annotations.NewScoreValue(0.92),
},
{
ID: "highlight-1",
Type: annotations.TypeComment,
SessionID: rec.Metadata.SessionID,
Target: annotations.InTimeRange(startTime, endTime),
Key: "observation",
Value: annotations.NewCommentValue("Good response latency"),
},
}
// Attach to player
player.SetAnnotations(anns)
// Query active annotations at any position
state := player.GetStateAt(1500 * time.Millisecond)
for _, ann := range state.ActiveAnnotations {
fmt.Printf("Annotation: %s = %v\n", ann.Key, ann.Value)
}

Use recordings for deterministic test replay:

import "github.com/AltairaLabs/PromptKit/runtime/providers/replay"
// Create replay provider from recording
provider, err := replay.NewProviderFromRecording(rec)
if err != nil {
panic(err)
}
// Use like any other provider - returns recorded responses
response, err := provider.Complete(ctx, messages, opts)

This enables:

  • Regression testing without API calls
  • Reproducing exact conversation flows
  • Testing against known-good responses

Use the Eval configuration type to validate and test saved conversations with assertions and LLM judges:

# evals/validate-recording.eval.yaml
apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Eval
metadata:
name: validate-support-session
spec:
id: support-validation
description: Validate customer support conversation quality
recording:
path: recordings/session-abc123.recording.json
type: session
judge_targets:
default:
type: openai
model: gpt-4o
id: quality-judge
assertions:
- type: llm_judge
params:
judge: default
criteria: "Did the agent provide helpful and accurate information?"
expected: pass
- type: llm_judge
params:
judge: default
criteria: "Was the conversation tone professional and empathetic?"
expected: pass

Reference the eval in your arena configuration:

# arena.yaml
apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Arena
metadata:
name: evaluation-suite
spec:
providers:
- file: providers/openai-gpt4o.provider.yaml
evals:
- file: evals/validate-recording.eval.yaml

Run evaluations:

Terminal window
promptarena run --config arena.yaml

This workflow enables:

  • Quality assurance: Validate conversation quality with LLM judges
  • Regression testing: Ensure consistency across model updates
  • Batch evaluation: Test multiple recordings with the same criteria
  • CI/CD integration: Automated conversation quality checks

See the Eval Configuration Reference for complete documentation.

See examples/session-replay/ for a full working example:

Terminal window
cd examples/session-replay
go run demo/replay_example.go
# Or with your own recording:
go run demo/replay_example.go path/to/recording.jsonl

The example demonstrates:

  • Loading recordings
  • Creating a replay player
  • Simulating playback with event correlation
  • Exporting audio tracks
  • Working with annotations
out/
└── recordings/
├── run-abc123.jsonl # One file per test run
├── run-def456.jsonl
└── run-ghi789.jsonl
  • Size: Recordings with audio can be large (audio data is base64-encoded)
  • Retention: Consider cleanup policies for old recordings
  • Compression: JSONL files compress well with gzip

Save recordings as artifacts for debugging failed tests:

# .github/workflows/test.yml
- name: Run Arena Tests
run: promptarena run --ci
- name: Upload Recordings
if: failure()
uses: actions/upload-artifact@v4
with:
name: session-recordings
path: out/recordings/
retention-days: 7
  • Complete event stream with precise timestamps
  • Audio chunks for voice conversations
  • Message content and metadata
  • Tool calls with arguments and results
  • Provider call timing and token usage
  • Validation results
  • Post-run human feedback (UserFeedback)
  • Session tags added after completion
  • External annotations (stored in separate files)