Output Formats
Output Formats
PromptArena supports multiple output formats for test results, each optimized for different use cases. You can generate multiple formats simultaneously from a single test run.
Supported Formats
| Format | Use Case | File Extension | CI/CD Integration |
|---|---|---|---|
| JSON | Programmatic access, APIs | .json | ✅ Excellent |
| HTML | Human review, reports | .html | ⚠️ Manual review |
| Markdown | Documentation, GitHub | .md | ✅ Good |
| JUnit XML | CI/CD systems | .xml | ✅ Excellent |
Output Directory Structure
After running tests, Arena creates the following structure:
out/
results.json # JSON results
report.html # HTML report
report.md # Markdown report
junit.xml # JUnit XML
media/ # Media storage (images, audio, video)
run-20241124-123456/
session-xyz/
conv-abc/
image1.png
image1.png.meta
Media Directory:
Arena automatically creates a media/ directory to store large media content (images, audio, video) generated or processed during tests. This prevents memory issues and makes test artifacts easy to access.
- Organization: By-run (each test run isolated)
- Deduplication: Enabled (shared media stored once)
- Metadata: Each media file has a
.metasidecar with context - Location:
{output_dir}/media/run-{timestamp}/
See: Media Storage Documentation for details.
Configuration
Configure output in arena.yaml:
defaults:
output:
dir: out # Output directory
formats: ["json", "html", "markdown", "junit"]
# Format-specific options
json:
file: results.json
pretty: true
include_raw: false
html:
file: report.html
include_metadata: true
theme: light
markdown:
file: report.md
include_details: true
junit:
file: junit.xml
include_system_out: true
JSON Format
Machine-readable format for programmatic access and integrations.
Structure
{
"arena_config": {
"name": "customer-support-arena",
"timestamp": "2024-01-15T10:30:00Z",
"version": "v1.0.0"
},
"summary": {
"total_tests": 15,
"passed": 12,
"failed": 3,
"errors": 0,
"skipped": 0,
"total_duration": "45.2s",
"total_cost": 0.0234,
"total_tokens": 4521
},
"results": [
{
"scenario": "basic-qa",
"provider": "openai-gpt4o-mini",
"status": "passed",
"duration": "3.2s",
"turns": [
{
"turn_number": 1,
"role": "user",
"content": "What is the capital of France?",
"response": {
"role": "assistant",
"content": "The capital of France is Paris.",
"cost_info": {
"input_tokens": 25,
"output_tokens": 12,
"cost": 0.00001
}
},
"assertions": [
{
"type": "content_includes",
"passed": true,
"message": "Should mention Paris",
"details": null
}
]
}
],
"cost_info": {
"total_input_tokens": 75,
"total_output_tokens": 45,
"total_cost": 0.00003
}
}
]
}
Configuration Options
json:
file: results.json # Output filename
pretty: true # Pretty-print JSON
include_raw: false # Include raw API responses
Use Cases
1. API Integration
# Parse results in script
jq '.summary.passed' out/results.json
2. Custom Reporting
import json
with open('out/results.json') as f:
results = json.load(f)
passed = results['summary']['passed']
total = results['summary']['total_tests']
print(f"Pass rate: {passed/total*100:.1f}%")
3. Data Analysis
# Analyze costs per provider
for result in results['results']:
provider = result['provider']
cost = result['cost_info']['total_cost']
print(f"{provider}: ${cost:.4f}")
Schema
Complete schema definition:
interface TestResults {
arena_config: {
name: string
timestamp: string
version: string
}
summary: {
total_tests: number
passed: number
failed: number
errors: number
skipped: number
total_duration: string
total_cost: number
total_tokens: number
average_cost: number
}
results: TestResult[]
}
interface TestResult {
scenario: string
provider: string
status: "passed" | "failed" | "error" | "skipped"
duration: string
error?: string
turns: Turn[]
cost_info: CostInfo
metadata?: Record<string, any>
}
interface Turn {
turn_number: number
role: "user" | "assistant"
content: string
response?: {
role: string
content: string
tool_calls?: ToolCall[]
cost_info: CostInfo
}
assertions?: Assertion[]
}
interface Assertion {
type: string
passed: boolean
message: string
details?: any
}
interface CostInfo {
input_tokens: number
output_tokens: number
cached_tokens?: number
cost: number
}
HTML Format
Interactive HTML report for human review.
Features
- Summary Dashboard: Overview with metrics
- Provider Comparison: Side-by-side results
- Conversation View: Full conversation transcripts
- Assertion Details: Pass/fail status with messages
- Cost Breakdown: Token usage and costs
- Filtering: Filter by status, provider, scenario
- Theming: Light and dark modes
Example Report
<!DOCTYPE html>
<html>
<head>
<title>PromptArena Test Report</title>
<style>/* Embedded CSS */</style>
</head>
<body>
<div class="summary-card">
<h2>Test Summary</h2>
<div class="metrics">
<div class="metric">
<span class="label">Total</span>
<span class="value">15</span>
</div>
<div class="metric success">
<span class="label">Passed</span>
<span class="value">12</span>
</div>
<div class="metric failure">
<span class="label">Failed</span>
<span class="value">3</span>
</div>
</div>
</div>
<!-- Detailed results -->
<div class="results">
<!-- ... -->
</div>
</body>
</html>
Configuration Options
html:
file: report.html # Output filename
include_metadata: true # Include test metadata
theme: light # Theme: light | dark
Viewing
# Open in browser
open out/report.html
# Or serve via HTTP
python -m http.server 8000
# Navigate to http://localhost:8000/out/report.html
Sections
1. Summary Dashboard
┌─────────────────────────────────────────┐
│ PromptArena Test Report │
│ │
│ Total: 15 Passed: 12 Failed: 3 │
│ Duration: 45.2s Cost: $0.0234 │
│ Tokens: 4521 (input: 3200, output: 1321)│
└─────────────────────────────────────────┘
2. Provider Comparison
┌────────────────┬──────────┬─────────┬────────┐
│ Provider │ Tests │ Pass % │ Cost │
├────────────────┼──────────┼─────────┼────────┤
│ GPT-4o-mini │ 5 │ 100% │ $0.008 │
│ Claude Sonnet │ 5 │ 80% │ $0.015 │
│ Gemini Flash │ 5 │ 80% │ $0.001 │
└────────────────┴──────────┴─────────┴────────┘
3. Detailed Results
Each test shows:
- Scenario name and description
- Provider and model
- Pass/fail status
- Full conversation transcript
- Assertion results
- Token usage and cost
- Execution time
Customization
The HTML report uses embedded CSS. To customize:
- Generate report
- Save HTML file
- Edit
<style>section - Reload in browser
Markdown Format
GitHub-friendly markdown format for documentation.
Structure
# PromptArena Test Report
**Generated**: 2024-01-15 10:30:00
**Arena**: customer-support-arena
## Summary
- **Total Tests**: 15
- **Passed**: 12 ✅
- **Failed**: 3 ❌
- **Duration**: 45.2s
- **Cost**: $0.0234
- **Tokens**: 4,521
## Results by Provider
### OpenAI GPT-4o-mini
#### Scenario: basic-qa
**Status**: ✅ PASSED
**Duration**: 3.2s
**Cost**: $0.00003
##### Turn 1
**User**: What is the capital of France?
**Assistant**: The capital of France is Paris.
**Assertions**:
- ✅ content_includes: Should mention Paris
**Tokens**: 37 (input: 25, output: 12)
**Cost**: $0.00001
---
### Claude 3.5 Sonnet
...
Configuration Options
markdown:
file: report.md # Output filename
include_details: true # Include full conversation details
Use Cases
1. GitHub Actions Summary
- name: Generate Report
run: promptarena run arena.yaml
- name: Comment on PR
uses: actions/github-script@v6
with:
script: |
const fs = require('fs');
const report = fs.readFileSync('out/report.md', 'utf8');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: report
});
2. Documentation
Include test results in docs:
# API Testing Results
[include file="test-results/report.md"]
3. Slack/Teams Notifications
Send markdown to collaboration tools:
# Convert to Slack format
cat out/report.md | slack-markdown-converter | \
slack-cli chat-post-message --channel #testing
JUnit XML Format
Standard format for CI/CD systems (Jenkins, GitLab CI, GitHub Actions, etc.).
Structure
<?xml version="1.0" encoding="UTF-8"?>
<testsuites name="PromptArena Tests" tests="15" failures="3" errors="0" time="45.2">
<testsuite name="basic-qa" tests="3" failures="0" errors="0" time="9.6">
<testcase name="basic-qa.openai-gpt4o-mini" classname="basic-qa" time="3.2">
<system-out>
Turn 1:
User: What is the capital of France?
Assistant: The capital of France is Paris.
Assertions: ✅ content_includes
</system-out>
</testcase>
<testcase name="basic-qa.claude-sonnet" classname="basic-qa" time="3.1">
<failure message="Assertion failed: content_includes" type="AssertionFailure">
Expected: Paris
Actual: The capital city of France is Paris.
Assertion: Should mention Paris
</failure>
</testcase>
</testsuite>
</testsuites>
Configuration Options
junit:
file: junit.xml # Output filename
include_system_out: true # Include conversation in <system-out>
CI/CD Integration
GitHub Actions
- name: Run Tests
run: promptarena run arena.yaml
- name: Publish Test Results
uses: EnricoMi/publish-unit-test-result-action@v2
if: always()
with:
files: out/junit.xml
GitLab CI
test:
script:
- promptarena run arena.yaml
artifacts:
reports:
junit: out/junit.xml
Jenkins
pipeline {
stages {
stage('Test') {
steps {
sh 'promptarena run arena.yaml'
}
post {
always {
junit 'out/junit.xml'
}
}
}
}
}
CircleCI
- run:
name: Run Tests
command: promptarena run arena.yaml
- store_test_results:
path: out/junit.xml
Multiple Formats
Generate all formats in one run:
defaults:
output:
formats: ["json", "html", "markdown", "junit"]
Output structure:
out/
├── results.json
├── report.html
├── report.md
└── junit.xml
Custom Output Directory
defaults:
output:
dir: test-results-2024-01-15
# Or override via CLI
promptarena run arena.yaml --output custom-dir
Programmatic Access
Python
import json
# Load JSON results
with open('out/results.json') as f:
results = json.load(f)
# Calculate pass rate
summary = results['summary']
pass_rate = summary['passed'] / summary['total_tests'] * 100
# Find expensive scenarios
for result in results['results']:
if result['cost_info']['total_cost'] > 0.01:
print(f"Expensive: {result['scenario']} - ${result['cost_info']['total_cost']:.4f}")
# Find failing assertions
for result in results['results']:
if result['status'] == 'failed':
for turn in result['turns']:
for assertion in turn.get('assertions', []):
if not assertion['passed']:
print(f"Failed: {result['scenario']} - {assertion['message']}")
Node.js
const fs = require('fs');
// Load results
const results = JSON.parse(fs.readFileSync('out/results.json'));
// Generate custom report
const report = results.results.map(r => ({
scenario: r.scenario,
provider: r.provider,
passed: r.status === 'passed',
cost: r.cost_info.total_cost
}));
console.table(report);
Go
package main
import (
"encoding/json"
"os"
)
type Results struct {
Summary struct {
TotalTests int `json:"total_tests"`
Passed int `json:"passed"`
TotalCost float64 `json:"total_cost"`
} `json:"summary"`
}
func main() {
data, _ := os.ReadFile("out/results.json")
var results Results
json.Unmarshal(data, &results)
passRate := float64(results.Summary.Passed) / float64(results.Summary.TotalTests) * 100
fmt.Printf("Pass Rate: %.1f%%\n", passRate)
fmt.Printf("Total Cost: $%.4f\n", results.Summary.TotalCost)
}
Performance Considerations
File Sizes
Typical sizes for 100 tests:
| Format | Approx. Size | Notes |
|---|---|---|
| JSON | 500 KB | Can be large with raw responses |
| HTML | 800 KB | Embedded CSS/JS |
| Markdown | 300 KB | Most compact |
| JUnit XML | 200 KB | Minimal data |
Optimization
Reduce JSON size:
json:
include_raw: false # Omit raw API responses
pretty: false # No formatting
Faster HTML generation:
html:
include_metadata: false # Skip detailed metadata
Best Practices
1. Use Right Format for Context
# Development
formats: ["html"] # Quick visual review
# CI/CD
formats: ["junit", "json"] # Integration + data
# Documentation
formats: ["markdown"] # Human-readable
# Production
formats: ["json", "junit"] # Programmatic + CI
2. Version Control
# .gitignore
out/
test-results/
*.html
Commit configuration, not results.
3. Archive Historical Results
# Archive with timestamp
DATE=$(date +%Y%m%d-%H%M%S)
mv out test-results-$DATE
tar -czf test-results-$DATE.tar.gz test-results-$DATE
4. Parse for Metrics
# Extract pass rate
jq '.summary | {total: .total_tests, passed: .passed, rate: ((.passed / .total_tests) * 100)}' out/results.json
# Extract cost by provider
jq '.results | group_by(.provider) | map({provider: .[0].provider, cost: map(.cost_info.total_cost) | add})' out/results.json
Next Steps
- CI/CD Integration - Running in pipelines
- Configuration Reference - Output configuration
- Best Practices - Production tips
Examples: See examples/ for output configuration patterns.