Skip to content

Session Recording

Learn how to capture detailed session recordings for debugging, replay, and analysis of Arena test runs.

  • Deep debugging: See exact event sequences and timing
  • Audio reconstruction: Export voice conversations as WAV files
  • Replay capability: Recreate test runs with deterministic providers
  • Performance analysis: Analyze per-event timing and latency
  • Annotation support: Add labels, scores, and comments to recordings

Add the recording configuration to your arena config:

# arena.yaml
apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Arena
metadata:
name: my-arena
spec:
defaults:
output:
dir: out
formats: [json, html]
recording:
enabled: true # Enable recording
dir: recordings # Subdirectory for recordings

Run your tests normally:

Terminal window
promptarena run --scenario my-test

Recordings are saved to out/recordings/ as JSONL files.

Each recording is a JSONL file containing:

  1. Metadata line: Session info, timing, provider details
  2. Event lines: Individual events with timestamps and data
{"type":"metadata","session_id":"run-123","start_time":"2024-01-15T10:30:00Z","duration":"5.2s",...}
{"type":"event","timestamp":"2024-01-15T10:30:00.100Z","event_type":"conversation.started",...}
{"type":"event","timestamp":"2024-01-15T10:30:00.500Z","event_type":"message.created",...}
{"type":"event","timestamp":"2024-01-15T10:30:01.200Z","event_type":"audio.input",...}

Session recordings capture a comprehensive event stream covering all runtime activity:

CategoryEvent TypeDescription
Conversationconversation.startedSession initialization with system prompt
Messagesmessage.createdUser and assistant messages with multimodal content parts
message.updatedMetadata updates (latency, token counts, cost)
Pipelinepipeline.startedPipeline execution start with middleware count
pipeline.completedPipeline completion with total cost/tokens
pipeline.failedPipeline failure with error
Stagesstage.startedStreaming stage start with type info
stage.completedStage completion with duration
stage.failedStage failure with error and duration
Middlewaremiddleware.startedMiddleware execution start
middleware.completedMiddleware completion with duration
middleware.failedMiddleware failure with error
Providerprovider.call.startedLLM API call initiation (provider, model, message count)
provider.call.completedLLM API call completion (tokens, cost, cached tokens)
provider.call.failedLLM API call failure with error
Toolstool.call.startedTool execution start with arguments
tool.call.completedTool completion with duration and status
tool.call.failedTool failure with error and duration
Client Toolstool.client.requestClient-side tool request with consent message and categories
Validationvalidation.startedValidator execution start
validation.passedValidation success with duration
validation.failedValidation failure with violations
Contextcontext.builtContext window assembly (token count, budget, truncation)
context.token_budget_exceededToken budget overflow (required vs. budget)
Statestate.loadedConversation state loaded from store
state.savedConversation state persisted
Streamingstream.interruptedStream interruption with reason
Workflowworkflow.transitionedWorkflow state transition (from/to states, event, prompt task)
workflow.completedWorkflow reached terminal state (final state, transition count)
Audioaudio.inputUser/environment audio chunks
audio.outputAssistant audio chunks
audio.transcriptionSpeech-to-text transcription result
Videovideo.frameVideo frame capture
Imagesimage.inputImage input from user/environment
image.outputImage output from agent
screenshotScreenshot capture

The message.created event includes a Parts field ([]ContentPart) for multimodal messages. Each ContentPart has a Type field indicating the content kind:

  • text — Text content with a Text field
  • image — Image content with a Media field containing MIME type, URL, or inline data
  • audio — Audio content with a Media field
  • video — Video content with a Media field

Audio, video, and image events use BinaryPayload for content storage:

  • InlineData — Raw bytes for small payloads
  • StorageReference — Backend-specific storage reference for externalized content
  • MIMEType — Content type (e.g., audio/pcm, image/png)

Audio events include AudioMetadata (sample rate, channels, encoding, duration) and video events include VideoMetadata (width, height, frame rate) for media reconstruction.

package main
import (
"fmt"
"github.com/AltairaLabs/PromptKit/runtime/recording"
)
func main() {
// Load recording
rec, err := recording.Load("out/recordings/run-123.jsonl")
if err != nil {
panic(err)
}
// Print metadata
fmt.Printf("Session: %s\n", rec.Metadata.SessionID)
fmt.Printf("Duration: %v\n", rec.Metadata.Duration)
fmt.Printf("Events: %d\n", rec.Metadata.EventCount)
fmt.Printf("Provider: %s\n", rec.Metadata.ProviderName)
fmt.Printf("Model: %s\n", rec.Metadata.Model)
// Iterate events
for _, event := range rec.Events {
fmt.Printf("[%v] %s\n", event.Offset, event.Type)
}
}

For voice conversations, extract audio tracks as WAV files:

// Create replay player
player, err := recording.NewReplayPlayer(rec)
if err != nil {
panic(err)
}
timeline := player.Timeline()
// Export user audio
if timeline.HasTrack(events.TrackAudioInput) {
err := timeline.ExportAudioToWAV(events.TrackAudioInput, "user_audio.wav")
if err != nil {
fmt.Printf("Export failed: %v\n", err)
}
}
// Export assistant audio
if timeline.HasTrack(events.TrackAudioOutput) {
err := timeline.ExportAudioToWAV(events.TrackAudioOutput, "assistant_audio.wav")
if err != nil {
fmt.Printf("Export failed: %v\n", err)
}
}

Use the ReplayPlayer for time-synchronized event access:

player, _ := recording.NewReplayPlayer(rec)
// Seek to specific position
player.Seek(2 * time.Second)
// Get state at current position
state := player.GetState()
fmt.Printf("Position: %s\n", player.FormatPosition())
fmt.Printf("Current events: %d\n", len(state.CurrentEvents))
fmt.Printf("Messages so far: %d\n", len(state.Messages))
fmt.Printf("Audio input active: %v\n", state.AudioInputActive)
fmt.Printf("Audio output active: %v\n", state.AudioOutputActive)
// Advance through recording
for {
events := player.Advance(100 * time.Millisecond)
if player.Position() >= player.Duration() {
break
}
for _, e := range events {
fmt.Printf("[%v] %s\n", e.Offset, e.Type)
}
}

Attach annotations to recordings for review and analysis:

import "github.com/AltairaLabs/PromptKit/runtime/annotations"
// Create annotations
anns := []*annotations.Annotation{
{
ID: "quality-1",
Type: annotations.TypeScore,
SessionID: rec.Metadata.SessionID,
Target: annotations.ForSession(),
Key: "overall_quality",
Value: annotations.NewScoreValue(0.92),
},
{
ID: "highlight-1",
Type: annotations.TypeComment,
SessionID: rec.Metadata.SessionID,
Target: annotations.InTimeRange(startTime, endTime),
Key: "observation",
Value: annotations.NewCommentValue("Good response latency"),
},
}
// Attach to player
player.SetAnnotations(anns)
// Query active annotations at any position
state := player.GetStateAt(1500 * time.Millisecond)
for _, ann := range state.ActiveAnnotations {
fmt.Printf("Annotation: %s = %v\n", ann.Key, ann.Value)
}

Use recordings for deterministic test replay:

import "github.com/AltairaLabs/PromptKit/runtime/providers/replay"
// Create replay provider from recording
provider, err := replay.NewProviderFromRecording(rec)
if err != nil {
panic(err)
}
// Use like any other provider - returns recorded responses
response, err := provider.Complete(ctx, messages, opts)

This enables:

  • Regression testing without API calls
  • Reproducing exact conversation flows
  • Testing against known-good responses

Use the Eval configuration type to validate and test saved conversations with assertions and LLM judges:

# evals/validate-recording.eval.yaml
apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Eval
metadata:
name: validate-support-session
spec:
id: support-validation
description: Validate customer support conversation quality
recording:
path: recordings/session-abc123.recording.json
type: session
judge_targets:
default:
type: openai
model: gpt-4o
id: quality-judge
assertions:
- type: llm_judge
params:
judge: default
criteria: "Did the agent provide helpful and accurate information?"
expected: pass
- type: llm_judge
params:
judge: default
criteria: "Was the conversation tone professional and empathetic?"
expected: pass

Reference the eval in your arena configuration:

# arena.yaml
apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Arena
metadata:
name: evaluation-suite
spec:
providers:
- file: providers/openai-gpt4o.provider.yaml
evals:
- file: evals/validate-recording.eval.yaml

Run evaluations:

Terminal window
promptarena run --config arena.yaml

This workflow enables:

  • Quality assurance: Validate conversation quality with LLM judges
  • Regression testing: Ensure consistency across model updates
  • Batch evaluation: Test multiple recordings with the same criteria
  • CI/CD integration: Automated conversation quality checks

See the Eval Configuration Reference for complete documentation.

See examples/session-replay/ for a full working example:

Terminal window
cd examples/session-replay
go run demo/replay_example.go
# Or with your own recording:
go run demo/replay_example.go path/to/recording.jsonl

The example demonstrates:

  • Loading recordings
  • Creating a replay player
  • Simulating playback with event correlation
  • Exporting audio tracks
  • Working with annotations
out/
└── recordings/
├── run-abc123.jsonl # One file per test run
├── run-def456.jsonl
└── run-ghi789.jsonl
  • Size: Recordings with audio can be large (audio data is base64-encoded)
  • Retention: Consider cleanup policies for old recordings
  • Compression: JSONL files compress well with gzip

Save recordings as artifacts for debugging failed tests:

# .github/workflows/test.yml
- name: Run Arena Tests
run: promptarena run --ci
- name: Upload Recordings
if: failure()
uses: actions/upload-artifact@v4
with:
name: session-recordings
path: out/recordings/
retention-days: 7
  • Complete event stream with precise timestamps
  • Audio chunks for voice conversations
  • Message content and metadata, including multimodal content parts
  • Tool calls with arguments, results, and timing
  • Client tool requests with consent information
  • Provider call timing and token usage
  • Stage and middleware execution lifecycle
  • Workflow state transitions and completion
  • Validation results
  • Post-run human feedback (UserFeedback)
  • Session tags added after completion
  • External annotations (stored in separate files)