Output Formats
Output Formats
Section titled “Output Formats”PromptArena supports multiple output formats for test results, each optimized for different use cases. You can generate multiple formats simultaneously from a single test run.
Supported Formats
Section titled “Supported Formats”| Format | Use Case | File Extension | CI/CD Integration |
|---|---|---|---|
| JSON | Programmatic access, APIs | .json | ✅ Excellent |
| HTML | Human review, reports | .html | ⚠️ Manual review |
| Markdown | Documentation, GitHub | .md | ✅ Good |
| JUnit XML | CI/CD systems | .xml | ✅ Excellent |
Output Directory Structure
Section titled “Output Directory Structure”After running tests, Arena creates the following structure:
out/ results.json # JSON results report.html # HTML report report.md # Markdown report junit.xml # JUnit XML media/ # Media storage (images, audio, video) run-20241124-123456/ session-xyz/ conv-abc/ image1.png image1.png.metaMedia Directory:
Arena automatically creates a media/ directory to store large media content (images, audio, video) generated or processed during tests. This prevents memory issues and makes test artifacts easy to access.
- Organization: By-run (each test run isolated)
- Deduplication: Enabled (shared media stored once)
- Metadata: Each media file has a
.metasidecar with context - Location:
{output_dir}/media/run-{timestamp}/
See: Media Storage Documentation for details.
Configuration
Section titled “Configuration”Configure output in arena.yaml:
defaults: output: dir: out # Output directory formats: ["json", "html", "markdown", "junit"]
# Format-specific options json: file: results.json pretty: true include_raw: false
html: file: report.html include_metadata: true theme: light
markdown: file: report.md include_details: true
junit: file: junit.xml include_system_out: trueJSON Format
Section titled “JSON Format”Machine-readable format for programmatic access and integrations.
Structure
Section titled “Structure”{ "arena_config": { "name": "customer-support-arena", "timestamp": "2024-01-15T10:30:00Z", "version": "v1.0.0" }, "summary": { "total_tests": 15, "passed": 12, "failed": 3, "errors": 0, "skipped": 0, "total_duration": "45.2s", "total_cost": 0.0234, "total_tokens": 4521 }, "results": [ { "scenario": "basic-qa", "provider": "openai-gpt4o-mini", "status": "passed", "duration": "3.2s", "turns": [ { "turn_number": 1, "role": "user", "content": "What is the capital of France?", "response": { "role": "assistant", "content": "The capital of France is Paris.", "cost_info": { "input_tokens": 25, "output_tokens": 12, "cost": 0.00001 } }, "assertions": [ { "type": "content_includes", "passed": true, "message": "Should mention Paris", "details": null } ] } ], "cost_info": { "total_input_tokens": 75, "total_output_tokens": 45, "total_cost": 0.00003 } } ]}Configuration Options
Section titled “Configuration Options”json: file: results.json # Output filename pretty: true # Pretty-print JSON include_raw: false # Include raw API responsesUse Cases
Section titled “Use Cases”1. API Integration
# Parse results in scriptjq '.summary.passed' out/results.json2. Custom Reporting
import json
with open('out/results.json') as f: results = json.load(f)
passed = results['summary']['passed']total = results['summary']['total_tests']print(f"Pass rate: {passed/total*100:.1f}%")3. Data Analysis
# Analyze costs per providerfor result in results['results']: provider = result['provider'] cost = result['cost_info']['total_cost'] print(f"{provider}: ${cost:.4f}")Schema
Section titled “Schema”Complete schema definition:
interface TestResults { arena_config: { name: string timestamp: string version: string } summary: { total_tests: number passed: number failed: number errors: number skipped: number total_duration: string total_cost: number total_tokens: number average_cost: number } results: TestResult[]}
interface TestResult { scenario: string provider: string status: "passed" | "failed" | "error" | "skipped" duration: string error?: string turns: Turn[] cost_info: CostInfo metadata?: Record<string, any>}
interface Turn { turn_number: number role: "user" | "assistant" content: string response?: { role: string content: string tool_calls?: ToolCall[] cost_info: CostInfo } assertions?: Assertion[]}
interface Assertion { type: string passed: boolean message: string details?: any}
interface CostInfo { input_tokens: number output_tokens: number cached_tokens?: number cost: number}HTML Format
Section titled “HTML Format”Interactive HTML report for human review.
Features
Section titled “Features”- Summary Dashboard: Overview with metrics
- Provider Comparison: Side-by-side results
- Conversation View: Full conversation transcripts
- Assertion Details: Pass/fail status with messages
- Cost Breakdown: Token usage and costs
- Filtering: Filter by status, provider, scenario
- Theming: Light and dark modes
Example Report
Section titled “Example Report”<!DOCTYPE html><html><head> <title>PromptArena Test Report</title> <style>/* Embedded CSS */</style></head><body> <div class="summary-card"> <h2>Test Summary</h2> <div class="metrics"> <div class="metric"> <span class="label">Total</span> <span class="value">15</span> </div> <div class="metric success"> <span class="label">Passed</span> <span class="value">12</span> </div> <div class="metric failure"> <span class="label">Failed</span> <span class="value">3</span> </div> </div> </div>
<!-- Detailed results --> <div class="results"> <!-- ... --> </div></body></html>Configuration Options
Section titled “Configuration Options”html: file: report.html # Output filename include_metadata: true # Include test metadata theme: light # Theme: light | darkViewing
Section titled “Viewing”# Open in browseropen out/report.html
# Or serve via HTTPpython -m http.server 8000# Navigate to http://localhost:8000/out/report.htmlSections
Section titled “Sections”1. Summary Dashboard
Section titled “1. Summary Dashboard”┌─────────────────────────────────────────┐│ PromptArena Test Report ││ ││ Total: 15 Passed: 12 Failed: 3 ││ Duration: 45.2s Cost: $0.0234 ││ Tokens: 4521 (input: 3200, output: 1321)│└─────────────────────────────────────────┘2. Provider Comparison
Section titled “2. Provider Comparison”┌────────────────┬──────────┬─────────┬────────┐│ Provider │ Tests │ Pass % │ Cost │├────────────────┼──────────┼─────────┼────────┤│ GPT-4o-mini │ 5 │ 100% │ $0.008 ││ Claude Sonnet │ 5 │ 80% │ $0.015 ││ Gemini Flash │ 5 │ 80% │ $0.001 │└────────────────┴──────────┴─────────┴────────┘3. Detailed Results
Section titled “3. Detailed Results”Each test shows:
- Scenario name and description
- Provider and model
- Pass/fail status
- Full conversation transcript
- Assertion results
- Token usage and cost
- Execution time
Customization
Section titled “Customization”The HTML report uses embedded CSS. To customize:
- Generate report
- Save HTML file
- Edit
<style>section - Reload in browser
Markdown Format
Section titled “Markdown Format”GitHub-friendly markdown format for documentation.
Structure
Section titled “Structure”# PromptArena Test Report
**Generated**: 2024-01-15 10:30:00**Arena**: customer-support-arena
## Summary
- **Total Tests**: 15- **Passed**: 12 ✅- **Failed**: 3 ❌- **Duration**: 45.2s- **Cost**: $0.0234- **Tokens**: 4,521
## Results by Provider
### OpenAI GPT-4o-mini
#### Scenario: basic-qa
**Status**: ✅ PASSED**Duration**: 3.2s**Cost**: $0.00003
##### Turn 1
**User**: What is the capital of France?
**Assistant**: The capital of France is Paris.
**Assertions**:- ✅ content_includes: Should mention Paris
**Tokens**: 37 (input: 25, output: 12)**Cost**: $0.00001
---
### Claude 3.5 Sonnet
...Configuration Options
Section titled “Configuration Options”markdown: file: report.md # Output filename include_details: true # Include full conversation detailsUse Cases
Section titled “Use Cases”1. GitHub Actions Summary
- name: Generate Report run: promptarena run arena.yaml- name: Comment on PR uses: actions/github-script@v6 with: script: | const fs = require('fs'); const report = fs.readFileSync('out/report.md', 'utf8'); github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: report });2. Documentation
Include test results in docs:
# API Testing Results
[include file="test-results/report.md"]3. Slack/Teams Notifications
Send markdown to collaboration tools:
# Convert to Slack formatcat out/report.md | slack-markdown-converter | \ slack-cli chat-post-message --channel #testingJUnit XML Format
Section titled “JUnit XML Format”Standard format for CI/CD systems (Jenkins, GitLab CI, GitHub Actions, etc.).
Structure
Section titled “Structure”<?xml version="1.0" encoding="UTF-8"?><testsuites name="PromptArena Tests" tests="15" failures="3" errors="0" time="45.2"> <testsuite name="basic-qa" tests="3" failures="0" errors="0" time="9.6"> <testcase name="basic-qa.openai-gpt4o-mini" classname="basic-qa" time="3.2"> <system-out> Turn 1: User: What is the capital of France? Assistant: The capital of France is Paris. Assertions: ✅ content_includes </system-out> </testcase>
<testcase name="basic-qa.claude-sonnet" classname="basic-qa" time="3.1"> <failure message="Assertion failed: content_includes" type="AssertionFailure"> Expected: Paris Actual: The capital city of France is Paris. Assertion: Should mention Paris </failure> </testcase> </testsuite></testsuites>Configuration Options
Section titled “Configuration Options”junit: file: junit.xml # Output filename include_system_out: true # Include conversation in <system-out>CI/CD Integration
Section titled “CI/CD Integration”GitHub Actions
Section titled “GitHub Actions”- name: Run Tests run: promptarena run arena.yaml
- name: Publish Test Results uses: EnricoMi/publish-unit-test-result-action@v2 if: always() with: files: out/junit.xmlGitLab CI
Section titled “GitLab CI”test: script: - promptarena run arena.yaml artifacts: reports: junit: out/junit.xmlJenkins
Section titled “Jenkins”pipeline { stages { stage('Test') { steps { sh 'promptarena run arena.yaml' } post { always { junit 'out/junit.xml' } } } }}CircleCI
Section titled “CircleCI”- run: name: Run Tests command: promptarena run arena.yaml- store_test_results: path: out/junit.xmlMultiple Formats
Section titled “Multiple Formats”Generate all formats in one run:
defaults: output: formats: ["json", "html", "markdown", "junit"]Output structure:
out/├── results.json├── report.html├── report.md└── junit.xmlCustom Output Directory
Section titled “Custom Output Directory”defaults: output: dir: test-results-2024-01-15# Or override via CLIpromptarena run arena.yaml --output custom-dirProgrammatic Access
Section titled “Programmatic Access”Python
Section titled “Python”import json
# Load JSON resultswith open('out/results.json') as f: results = json.load(f)
# Calculate pass ratesummary = results['summary']pass_rate = summary['passed'] / summary['total_tests'] * 100
# Find expensive scenariosfor result in results['results']: if result['cost_info']['total_cost'] > 0.01: print(f"Expensive: {result['scenario']} - ${result['cost_info']['total_cost']:.4f}")
# Find failing assertionsfor result in results['results']: if result['status'] == 'failed': for turn in result['turns']: for assertion in turn.get('assertions', []): if not assertion['passed']: print(f"Failed: {result['scenario']} - {assertion['message']}")Node.js
Section titled “Node.js”const fs = require('fs');
// Load resultsconst results = JSON.parse(fs.readFileSync('out/results.json'));
// Generate custom reportconst report = results.results.map(r => ({ scenario: r.scenario, provider: r.provider, passed: r.status === 'passed', cost: r.cost_info.total_cost}));
console.table(report);package main
import ( "encoding/json" "os")
type Results struct { Summary struct { TotalTests int `json:"total_tests"` Passed int `json:"passed"` TotalCost float64 `json:"total_cost"` } `json:"summary"`}
func main() { data, _ := os.ReadFile("out/results.json") var results Results json.Unmarshal(data, &results)
passRate := float64(results.Summary.Passed) / float64(results.Summary.TotalTests) * 100 fmt.Printf("Pass Rate: %.1f%%\n", passRate) fmt.Printf("Total Cost: $%.4f\n", results.Summary.TotalCost)}Performance Considerations
Section titled “Performance Considerations”File Sizes
Section titled “File Sizes”Typical sizes for 100 tests:
| Format | Approx. Size | Notes |
|---|---|---|
| JSON | 500 KB | Can be large with raw responses |
| HTML | 800 KB | Embedded CSS/JS |
| Markdown | 300 KB | Most compact |
| JUnit XML | 200 KB | Minimal data |
Optimization
Section titled “Optimization”Reduce JSON size:
json: include_raw: false # Omit raw API responses pretty: false # No formattingFaster HTML generation:
html: include_metadata: false # Skip detailed metadataBest Practices
Section titled “Best Practices”1. Use Right Format for Context
Section titled “1. Use Right Format for Context”# Developmentformats: ["html"] # Quick visual review
# CI/CDformats: ["junit", "json"] # Integration + data
# Documentationformats: ["markdown"] # Human-readable
# Productionformats: ["json", "junit"] # Programmatic + CI2. Version Control
Section titled “2. Version Control”# .gitignoreout/test-results/*.htmlCommit configuration, not results.
3. Archive Historical Results
Section titled “3. Archive Historical Results”# Archive with timestampDATE=$(date +%Y%m%d-%H%M%S)mv out test-results-$DATEtar -czf test-results-$DATE.tar.gz test-results-$DATE4. Parse for Metrics
Section titled “4. Parse for Metrics”# Extract pass ratejq '.summary | {total: .total_tests, passed: .passed, rate: ((.passed / .total_tests) * 100)}' out/results.json
# Extract cost by providerjq '.results | group_by(.provider) | map({provider: .[0].provider, cost: map(.cost_info.total_cost) | add})' out/results.jsonNext Steps
Section titled “Next Steps”- CI/CD Integration - Running in pipelines
- Configuration Reference - Output configuration
- Best Practices - Production tips
Examples: See examples/ for output configuration patterns.