CLI Commands
Complete command-line interface reference for PromptArena, the LLM testing framework.
Overview
Section titled “Overview”PromptArena (promptarena) is a CLI tool for running multi-turn conversation simulations across multiple LLM providers, validating conversation flows, and generating comprehensive test reports.
promptarena [command] [flags]Commands
Section titled “Commands”| Command | Description |
|---|---|
init | Initialize a new Arena test project from template (built-in or remote) |
run | Run conversation simulations (main command) |
mocks | Generate mock provider responses from Arena JSON results |
config-inspect | Inspect and validate configuration |
debug | Debug configuration and prompt loading |
prompt-debug | Debug and test prompt generation |
render | Generate HTML report from existing results |
completion | Generate shell autocompletion script |
help | Help about any command |
Global Flags
Section titled “Global Flags”-h, --help help for promptarenapromptarena init
Section titled “promptarena init”Initialize a new PromptArena test project from a built-in template.
promprarena init [directory] [flags]| Flag | Type | Default | Description |
|---|---|---|---|
--quick | bool | false | Skip interactive prompts, use defaults |
--provider | string | - | Provider to configure (mock, openai, claude, gemini) |
--template | string | quick-start | Template to use for initialization |
--list-templates | bool | false | List all available built-in templates |
--var | []string | - | Set template variables (key=value) |
--template-index | string | community | Template repo name or index URL/path for remote templates |
--repo-config | string | user config | Template repo config file |
--template-cache | string | temp dir | Cache directory for remote templates |
Built-In Templates
Section titled “Built-In Templates”PromptArena includes 6 built-in templates:
| Template | Files Generated | Description |
|---|---|---|
basic-chatbot | 6 files | Simple conversational testing setup |
customer-support | 10 files | Support agent with KB search and order status tools |
code-assistant | 9 files | Code generation and review with separate prompts |
content-generation | 9 files | Creative content for blogs, products, social media |
multimodal | 7 files | Image analysis and vision testing |
mcp-integration | 7 files | MCP filesystem server integration |
Examples
Section titled “Examples”List Available Templates
Section titled “List Available Templates”# See all built-in templatespromprarena init --list-templates
# List remote templates (from the default community repo)promptarena templates list
# List remote templates from a named repopromptarena templates repo add --name internal --url https://example.com/index.yamlpromptarena templates list --index internal
# List using repo/template shorthandpromptarena templates list --index communityQuick Start
Section titled “Quick Start”# Create project with defaults (basic-chatbot template)promprarena init my-test --quick
# With specific providerpromprarena init my-test --quick --provider openai
# With specific templatepromprarena init my-test --quick --template customer-support --provider openai
# Render a remote template explicitlypromptarena templates fetch --template community/basic-chatbot --version 1.0.0promptarena templates render --template community/basic-chatbot --version 1.0.0 --out ./outInteractive Mode
Section titled “Interactive Mode”# Interactive prompts guide you through setuppromprarena init my-projectTemplate Variables
Section titled “Template Variables”# Override template variablespromprarena init my-test --quick --provider openai \ --var project_name="My Custom Project" \ --var description="Custom description" \ --var temperature=0.8What Gets Created
Section titled “What Gets Created”Depending on the template, init creates:
arena.yaml- Main Arena configurationprompts/- Prompt configurationsproviders/- Provider configurationsscenarios/- Test scenariostools/- Tool definitions (customer-support template).env- Environment variables with API key placeholders.gitignore- Ignores .env and output filesREADME.md- Project documentation and usage instructions
Template Comparison
Section titled “Template Comparison”basic-chatbot (6 files):
- Best for: Beginners, simple testing
- Includes: 1 prompt, 1 provider, 1 basic scenario
customer-support (10 files):
- Best for: Support agent testing, tool calling
- Includes: 1 prompt, 3 scenarios, 2 tools (KB search, order status)
code-assistant (9 files):
- Best for: Code generation workflows
- Includes: 2 prompts (generator, reviewer), 3 scenarios
- Temperature: 0.3 (deterministic)
content-generation (9 files):
- Best for: Marketing, creative writing
- Includes: 2 prompts (blog, marketing), 3 scenarios
- Temperature: 0.8 (creative)
multimodal (7 files):
- Best for: Vision AI, image analysis
- Includes: 1 vision prompt, 2 scenarios with sample images
mcp-integration (7 files):
- Best for: MCP server testing, tool integration
- Includes: 1 prompt, 2 scenarios, MCP filesystem server config
After Initialization
Section titled “After Initialization”# Navigate to projectcd my-test
# Add your API key to .envecho "OPENAI_API_KEY=sk-..." >> .env
# Run testspromprarena run
# View resultsopen out/report.htmlpromptarena mocks generate
Section titled “promptarena mocks generate”Generate mock provider YAML from recorded Arena JSON results so you can replay conversations without calling real LLMs.
promptarena mocks generate [flags]| Flag | Type | Default | Description |
|---|---|---|---|
--input, -i | string | out | Arena JSON result file or directory containing *.json runs |
--output, -o | string | providers/mock-generated.yaml | Output file path or directory (when --per-scenario is set) |
--per-scenario | bool | false | Write one YAML file per scenario (in --output directory) |
--merge | bool | false | Merge with existing mock file(s) instead of overwriting |
--scenario | []string | - | Only include specified scenario IDs |
--provider | []string | - | Only include specified provider IDs |
--dry-run | bool | false | Print generated YAML instead of writing files |
--default-response | string | - | Set defaultResponse when not present |
Examples
Section titled “Examples”Generate a consolidated mock file from the latest runs:
promptarena mocks generate \ --input out \ --scenario hardware-faults \ --provider openai-gpt4o \ --output providers/mock-generated.yaml \ --mergeWrite one file per scenario:
promptarena mocks generate \ --input out \ --per-scenario \ --output providers/responses \ --mergePreview without writing:
promptarena mocks generate --input out --dry-runpromptarena run
Section titled “promptarena run”Run multi-turn conversation simulations across multiple LLM providers.
promptarena run [flags]Configuration
Section titled “Configuration”| Flag | Type | Default | Description |
|---|---|---|---|
-c, --config | string | arena.yaml | Configuration file path |
Execution Control
Section titled “Execution Control”| Flag | Type | Default | Description |
|---|---|---|---|
-j, --concurrency | int | 6 | Number of concurrent workers |
-s, --seed | int | 42 | Random seed for reproducibility |
--ci | bool | false | CI mode (headless, minimal output) |
Filtering
Section titled “Filtering”| Flag | Type | Default | Description |
|---|---|---|---|
--provider | []string | all | Providers to use (comma-separated) |
--scenario | []string | all | Scenarios to run (comma-separated) |
--region | []string | all | Regions to run (comma-separated) |
--roles | []string | all | Self-play role configurations to use |
Parameter Overrides
Section titled “Parameter Overrides”| Flag | Type | Default | Description |
|---|---|---|---|
--temperature | float32 | 0.6 | Override temperature for all scenarios |
--max-tokens | int | - | Override max tokens for all scenarios |
Self-Play Mode
Section titled “Self-Play Mode”| Flag | Type | Default | Description |
|---|---|---|---|
--selfplay | bool | false | Enable self-play mode |
Mock Testing
Section titled “Mock Testing”| Flag | Type | Default | Description |
|---|---|---|---|
--mock-provider | bool | false | Replace all providers with MockProvider |
--mock-config | string | - | Path to mock provider configuration (YAML) |
Output Configuration
Section titled “Output Configuration”| Flag | Type | Default | Description |
|---|---|---|---|
-o, --out | string | out | Output directory |
--format | []string | from config | Output formats: json, junit, html, markdown |
--formats | []string | from config | Alias for —format |
Legacy Output Flags (Deprecated)
Section titled “Legacy Output Flags (Deprecated)”| Flag | Type | Default | Description |
|---|---|---|---|
--html | bool | false | Generate HTML report (use —format html instead) |
--html-file | string | out/report-[timestamp].html | HTML report output file |
--junit-file | string | out/junit.xml | JUnit XML output file |
--markdown-file | string | out/results.md | Markdown report output file |
Debugging
Section titled “Debugging”| Flag | Type | Default | Description |
|---|---|---|---|
-v, --verbose | bool | false | Enable verbose debug logging for API calls |
Examples
Section titled “Examples”Basic Run
Section titled “Basic Run”# Run all tests with default configurationpromptarena run
# Specify configuration filepromptarena run --config my-arena.yamlFilter Execution
Section titled “Filter Execution”# Run specific providers onlypromptarena run --provider openai,claude
# Run specific scenariospromptarena run --scenario basic-qa,edge-cases
# Combine filterspromptarena run --provider openai --scenario customer-supportControl Parallelism
Section titled “Control Parallelism”# Run with 3 concurrent workerspromptarena run --concurrency 3
# Sequential execution (no parallelism)promptarena run --concurrency 1Override Parameters
Section titled “Override Parameters”# Override temperature for all testspromptarena run --temperature 0.8
# Override max tokenspromptarena run --max-tokens 500
# Combined overridespromptarena run --temperature 0.9 --max-tokens 1000Output Formats
Section titled “Output Formats”# Generate JSON and HTML reportspromptarena run --format json,html
# Generate all available formatspromptarena run --format json,junit,html,markdown
# Custom output directorypromptarena run --out test-results-2024-01-15
# Specify custom HTML filename (legacy)promptarena run --html --html-file custom-report.htmlMock Testing
Section titled “Mock Testing”# Use mock provider instead of real APIs (fast, no cost)promptarena run --mock-provider
# Use custom mock configurationpromptarena run --mock-config mock-responses.yamlSelf-Play Mode
Section titled “Self-Play Mode”# Enable self-play testingpromptarena run --selfplay
# Self-play with specific rolespromptarena run --selfplay --roles frustrated-customer,tech-supportCI/CD Mode
Section titled “CI/CD Mode”# Headless mode for CI pipelinespromptarena run --ci --format junit,json
# With specific quality gatespromptarena run --ci --concurrency 3 --format junitDebugging
Section titled “Debugging”# Verbose output for troubleshootingpromptarena run --verbose
# Verbose with specific scenariopromptarena run --verbose --scenario failing-testReproducible Tests
Section titled “Reproducible Tests”# Use specific seed for reproducibilitypromptarena run --seed 12345
# Same seed across runs produces same resultspromptarena run --seed 12345 --provider openaipromptarena config-inspect
Section titled “promptarena config-inspect”Inspect and validate arena configuration, showing all loaded resources and validating cross-references. This command provides a rich, styled display of your configuration with validation results.
promptarena config-inspect [flags]| Flag | Type | Default | Description |
|---|---|---|---|
-c, --config | string | arena.yaml | Configuration file path |
--format | string | text | Output format: text, json |
-s, --short | bool | false | Show only validation results (shortcut for --section validation) |
--section | string | - | Focus on specific section: prompts, providers, scenarios, tools, selfplay, judges, defaults, validation |
--verbose | bool | false | Show detailed information including file contents |
--stats | bool | false | Show cache statistics |
Examples
Section titled “Examples”# Inspect default configurationpromptarena config-inspect
# Inspect specific config filepromptarena config-inspect --config staging-arena.yaml
# Verbose output with full detailspromptarena config-inspect --verbose
# Quick validation check onlypromptarena config-inspect --short# orpromptarena config-inspect -s
# Focus on specific sectionpromptarena config-inspect --section providerspromptarena config-inspect --section selfplaypromptarena config-inspect --section validation
# JSON output for programmatic usepromptarena config-inspect --format json
# Show cache statisticspromptarena config-inspect --statsSections
Section titled “Sections”The --section flag allows focusing on specific parts of the configuration:
| Section | Description |
|---|---|
prompts | Prompt configurations with task types, variables, validators |
providers | Provider details organized by group (default, judge, selfplay) |
scenarios | Scenario details with turn counts and assertion summaries |
tools | Tool definitions with modes, parameters, timeouts |
selfplay | Self-play configuration including personas and roles |
judges | Judge configurations for LLM-as-judge validators |
defaults | Default settings (temperature, max tokens, concurrency) |
validation | Validation results and connectivity checks |
Output
Section titled “Output”The command displays styled boxes with:
- Loaded prompt configurations with task types, variables, and validators
- Configured providers organized by group (default, judge, selfplay)
- Available scenarios with turn counts and assertion summaries
- Tool definitions with modes and parameters
- Self-play roles with persona associations
- Judge configurations
- Default settings
- Cross-reference validation results with connectivity checks
Example Output:
✨ PromptArena Configuration Inspector ✨
╭──────────────────────────────────────────────────────────────────────────────╮│ Configuration: arena.yaml │╰──────────────────────────────────────────────────────────────────────────────╯
📋 Prompt Configs (2)
╭──────────────────────────────────────────────────────────────────────────────╮│ troubleshooter-v2 ││ Task Type: troubleshooting ││ File: prompts/troubleshooter-v2.prompt.yaml │╰──────────────────────────────────────────────────────────────────────────────╯
🔌 Providers (3)
╭──────────────────────────────────────────────────────────────────────────────╮│ [default] ││ openai-gpt4o: gpt-4o (temp: 0.70, max: 1000) ││ ││ [judge] ││ judge-provider: gpt-4o-mini (temp: 0.00, max: 500) ││ ││ [selfplay] ││ mock-selfplay: mock-model (temp: 0.80, max: 1000) │╰──────────────────────────────────────────────────────────────────────────────╯
🎭 Self-Play (2 personas, 2 roles)
Personas:╭──────────────────────────────────────────────────────────────────────────────╮│ red-team-attacker ││ plant-operator │╰──────────────────────────────────────────────────────────────────────────────╯Roles:╭──────────────────────────────────────────────────────────────────────────────╮│ attacker (red-team-attacker) → openai-gpt4o ││ operator (plant-operator) → openai-gpt4o │╰──────────────────────────────────────────────────────────────────────────────╯
✅ Validation
╭──────────────────────────────────────────────────────────────────────────────╮│ ✓ Configuration is valid ││ ││ Connectivity Checks: ││ ☑ Tools are used by prompts ││ ☑ Unique task types per prompt ││ ☑ Scenario task types exist ││ ☑ Allowed tools are defined ││ ☑ Self-play roles have valid providers │╰──────────────────────────────────────────────────────────────────────────────╯promptarena debug
Section titled “promptarena debug”Debug command shows loaded configuration, prompt packs, scenarios, and providers to help troubleshoot configuration issues.
promptarena debug [flags]| Flag | Type | Default | Description |
|---|---|---|---|
-c, --config | string | arena.yaml | Configuration file path |
Examples
Section titled “Examples”# Debug default configurationpromptarena debug
# Debug specific configpromptarena debug --config test-arena.yamlUse Cases
Section titled “Use Cases”- Troubleshoot configuration loading issues
- Verify all files are found and parsed correctly
- Check prompt pack assembly
- Validate provider initialization
promptarena prompt-debug
Section titled “promptarena prompt-debug”Test prompt generation with specific regions, task types, and contexts. Useful for validating prompt assembly before running full tests.
promptarena prompt-debug [flags]| Flag | Type | Default | Description |
|---|---|---|---|
-c, --config | string | arena.yaml | Configuration file path |
-t, --task-type | string | - | Task type for prompt generation |
-r, --region | string | - | Region for prompt generation |
--persona | string | - | Persona ID to test |
--scenario | string | - | Scenario file path to load task_type and context |
--context | string | - | Context slot content |
--user | string | - | User context (e.g., “iOS developer”) |
--domain | string | - | Domain hint (e.g., “mobile development”) |
-l, --list | bool | false | List available regions and task types |
-j, --json | bool | false | Output as JSON |
-p, --show-prompt | bool | true | Show the full assembled prompt |
-m, --show-meta | bool | true | Show metadata and configuration info |
-s, --show-stats | bool | true | Show statistics (length, tokens, etc.) |
-v, --verbose | bool | false | Verbose output with debug info |
Examples
Section titled “Examples”# List available configurationspromptarena prompt-debug --list
# Test prompt generation for task typepromptarena prompt-debug --task-type support
# Test with regionpromptarena prompt-debug --task-type support --region us
# Test with personapromptarena prompt-debug --persona us-hustler-v1
# Test with scenario filepromptarena prompt-debug --scenario scenarios/customer-support.yaml
# Test with custom contextpromptarena prompt-debug --task-type support --context "urgent billing issue"
# JSON output for parsingpromptarena prompt-debug --task-type support --json
# Minimal output (just the prompt)promptarena prompt-debug --task-type support --show-meta=false --show-stats=falseOutput
Section titled “Output”The command shows:
- Assembled system prompt
- Metadata (task type, region, persona)
- Statistics (character count, estimated tokens)
- Configuration used
Example Output:
=== Prompt Debug ===
Task Type: supportRegion: usPersona: default
--- System Prompt ---You are a helpful customer support agent for TechCo.
Your role:- Answer product questions- Help track orders- Process returns and refunds...
--- Statistics ---Characters: 1,234Estimated Tokens: 308Lines: 42
--- Metadata ---Prompt Config: supportVersion: v1.0.0Validators: 3promptarena render
Section titled “promptarena render”Generate an HTML report from existing test results.
promptarena render [index.json path] [flags]| Flag | Type | Default | Description |
|---|---|---|---|
-o, --output | string | report-[timestamp].html | Output HTML file path |
Examples
Section titled “Examples”# Render from default locationpromptarena render out/index.json
# Custom output pathpromptarena render out/index.json --output custom-report.html
# Render from archived resultspromptarena render archive/2024-01-15/index.json --output reports/jan-15-report.htmlUse Cases
Section titled “Use Cases”- Regenerate reports after test runs
- Create reports with different formatting
- Archive and view historical results
- Share results without re-running tests
promptarena completion
Section titled “promptarena completion”Generate shell autocompletion script for bash, zsh, fish, or PowerShell.
promptarena completion [bash|zsh|fish|powershell]Examples
Section titled “Examples”# Bashpromptarena completion bash > /etc/bash_completion.d/promptarena
# Zshpromptarena completion zsh > "${fpath[1]}/_promptarena"
# Fishpromptarena completion fish > ~/.config/fish/completions/promptarena.fish
# PowerShellpromptarena completion powershell > promptarena.ps1Environment Variables
Section titled “Environment Variables”PromptArena respects the following environment variables:
| Variable | Description |
|---|---|
OPENAI_API_KEY | OpenAI API authentication |
ANTHROPIC_API_KEY | Anthropic API authentication |
GOOGLE_API_KEY | Google AI API authentication |
PROMPTARENA_CONFIG | Default configuration file (overrides config.arena.yaml) |
PROMPTARENA_OUTPUT | Default output directory (overrides out) |
Example
Section titled “Example”export OPENAI_API_KEY="sk-..."export ANTHROPIC_API_KEY="sk-ant-..."export PROMPTARENA_CONFIG="staging-arena.yaml"export PROMPTARENA_OUTPUT="test-results"
promptarena runExit Codes
Section titled “Exit Codes”| Code | Meaning |
|---|---|
0 | Success - all tests passed |
1 | Failure - one or more tests failed or error occurred |
Check exit code in scripts:
if promptarena run --ci; then echo "✅ Tests passed"else echo "❌ Tests failed" exit 1fiCommon Workflows
Section titled “Common Workflows”Local Development
Section titled “Local Development”# Quick test with mock providerspromptarena run --mock-provider
# Test specific featurepromptarena run --scenario new-feature --verbose
# Inspect configurationpromptarena config-inspect --verboseCI/CD Pipeline
Section titled “CI/CD Pipeline”# Run in headless CI modepromptarena run --ci --format junit,json
# Check specific providerspromptarena run --ci --provider openai,claude --format junitDebugging
Section titled “Debugging”# Validate configurationpromptarena config-inspect
# Debug prompt assemblypromptarena prompt-debug --task-type support --verbose
# Run with verbose loggingpromptarena run --verbose --scenario failing-test
# Check configuration loadingpromptarena debugReport Generation
Section titled “Report Generation”# Run testspromptarena run --format json
# Later, generate HTML from resultspromptarena render out/index.json --output reports/latest.htmlMulti-Provider Comparison
Section titled “Multi-Provider Comparison”# Test all providerspromptarena run --format html,json
# Test specific providerspromptarena run --provider openai,claude,gemini --format htmlConfiguration File
Section titled “Configuration File”PromptArena uses a YAML configuration file (default: config.arena.yaml). See the Configuration Reference for complete documentation.
Basic Structure
Section titled “Basic Structure”apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Arenametadata: name: my-arenaspec: prompt_configs: - id: assistant file: prompts/assistant.yaml
providers: - file: providers/openai.yaml
scenarios: - file: scenarios/test.yaml
defaults: output: dir: out formats: ["json", "html"]Multimodal Content & Media Rendering
Section titled “Multimodal Content & Media Rendering”PromptArena supports multimodal content (images, audio, video) in test scenarios with comprehensive media rendering in all output formats.
Media Content in Scenarios
Section titled “Media Content in Scenarios”Test scenarios can include multimodal content using the parts array:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: image-analysisspec: turns: - role: user parts: - type: text patterns: ["What's in this image?"] - type: image media: file_path: test-data/sample.jpg detail: highSupported Media Types
Section titled “Supported Media Types”- Images: JPEG, PNG, GIF, WebP
- Audio: MP3, WAV, OGG, M4A
- Video: MP4, WebM, MOV
Media Sources
Section titled “Media Sources”Media can be loaded from three sources:
1. Local Files
- type: image media: file_path: images/diagram.png detail: high2. URLs (fetched during test execution)
- type: image media: url: https://example.com/photo.jpg detail: auto3. Inline Base64 Data
- type: image media: data: "iVBORw0KGgoAAAANSUhEUgAAAAUA..." mime_type: image/png detail: lowMedia Rendering in Reports
Section titled “Media Rendering in Reports”All output formats include media statistics and rendering:
HTML Reports
Section titled “HTML Reports”HTML reports include:
Media Summary Dashboard
- Visual statistics cards showing:
- Total images, audio, and video files
- Successfully loaded vs. failed media
- Total media size in human-readable format
- Media type icons (🖼️ 🎵 🎬)
Media Badges
🖼️ x3 🎵 x2 ✅ 5 ❌ 0 💾 1.2 MBMedia Items Display
- Individual media items with:
- Type icon and format badge
- Source (file path, URL, or “inline”)
- MIME type
- File size
- Load status (✅ loaded / ❌ error)
Example HTML Output:
<div class="media-summary"> <div class="stat-card"> <div class="stat-value">5</div> <div class="stat-label">🖼️ Images</div> </div> <div class="stat-card"> <div class="stat-value">3</div> <div class="stat-label">🎵 Audio</div> </div> <!-- ... --></div>JUnit XML Reports
Section titled “JUnit XML Reports”JUnit XML includes media metadata as test suite properties:
<testsuite name="image-analysis" tests="1"> <properties> <property name="media.images.total" value="5"/> <property name="media.audio.total" value="3"/> <property name="media.video.total" value="0"/> <property name="media.loaded.success" value="8"/> <property name="media.loaded.errors" value="0"/> <property name="media.size.total_bytes" value="1245678"/> </properties> <testcase name="test-001" classname="image-analysis" time="2.34"/></testsuite>Property Naming Convention:
media.{type}.total- Count by media type (images, audio, video)media.loaded.success- Successfully loaded media itemsmedia.loaded.errors- Failed media loadsmedia.size.total_bytes- Total size in bytes
These properties are useful for:
- CI/CD metrics and tracking
- Test result analysis
- Media resource monitoring
Markdown Reports
Section titled “Markdown Reports”Markdown reports include a media statistics table in the overview section:
## 📊 Overview
| Metric | Value ||--------|-------|| Tests Run | 6 || Passed | 5 ✅ || Failed | 1 ❌ || Success Rate | 83.3% || Total Cost | $0.0245 || Total Duration | 12.5s |
### 🎨 Media Content
| Type | Count ||------|-------|| 🖼️ Images | 5 || 🎵 Audio Files | 3 || 🎬 Videos | 0 || ✅ Loaded | 8 || ❌ Errors | 0 || 💾 Total Size | 1.2 MB |Media Loading Options
Section titled “Media Loading Options”Control how media is loaded and processed:
HTTP Media Loader
Section titled “HTTP Media Loader”For URL-based media, configure the HTTP loader:
spec: defaults: media: http: timeout: 30s max_file_size: 50MBLocal File Paths
Section titled “Local File Paths”Relative paths are resolved from the configuration file directory:
# If arena.yaml is in /project/tests/# This resolves to /project/tests/images/sample.jpg- type: image media: file_path: images/sample.jpgMedia Validation
Section titled “Media Validation”PromptArena validates media content:
Path Security
- Prevents path traversal attacks (
..sequences) - Validates file paths are within allowed directories
- Checks symlink targets
File Validation
- Verifies MIME types match content types
- Checks file existence
- Validates file sizes against limits
- Ensures files are regular files (not directories)
Error Handling
- Media load failures are captured in test results
- Errors reported in all output formats
- Tests can continue with partial media failures
Examples
Section titled “Examples”Testing Image Analysis
Section titled “Testing Image Analysis”apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: product-image-analysisspec: task_type: vision turns: - role: user parts: - type: text patterns: ["Analyze this product image for defects"] - type: image media: file_path: test-data/product-123.jpg detail: high assertions: - type: content_includes patterns: ["quality", "inspection"]Testing Audio Transcription
Section titled “Testing Audio Transcription”apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: audio-transcriptionspec: task_type: transcription turns: - role: user parts: - type: text patterns: ["Transcribe this audio"] - type: audio media: file_path: test-data/meeting-recording.mp3 assertions: - type: content_includes patterns: ["meeting", "agenda"]Mixed Multimodal Content
Section titled “Mixed Multimodal Content”apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: multimodal-analysisspec: turns: - role: user parts: - type: text patterns: ["Compare these media files"] - type: image media: file_path: charts/q1-results.png - type: image media: file_path: charts/q2-results.png - type: audio media: file_path: presentations/summary.mp3Generate Media-Rich Reports
Section titled “Generate Media-Rich Reports”# Run multimodal tests with all formatspromptarena run --format html,junit,markdown
# HTML report includes interactive media dashboardopen out/report.html
# JUnit XML includes media metrics for CIcat out/junit.xml | grep "media\."
# Markdown shows media statisticscat out/results.mdMedia Statistics in CI/CD
Section titled “Media Statistics in CI/CD”Extract media metrics from JUnit XML:
# Count total images testedxmllint --xpath "//property[@name='media.images.total']/@value" out/junit.xml
# Check for media load errorsxmllint --xpath "//property[@name='media.loaded.errors']/@value" out/junit.xmlBest Practices
Section titled “Best Practices”File Organization
project/├── arena.yaml├── test-data/│ ├── images/│ │ ├── valid/│ │ └── invalid/│ ├── audio/│ └── video/└── scenarios/ └── multimodal-tests.yamlSize Limits
- Keep test media files small (<10MB recommended)
- Use compressed formats (WebP for images, MP3 for audio)
- Consider using thumbnails for image tests
URL Loading
- Use reliable, stable URLs for CI/CD
- Consider local copies for critical tests
- Set appropriate timeouts for remote resources
Assertions
- Validate media is processed in responses
- Check for expected content types
- Verify quality/accuracy of analysis
Media Assertions (Phase 1)
Section titled “Media Assertions (Phase 1)”Arena provides six specialized media validators to test media content in LLM responses. These assertions validate format, dimensions, duration, and resolution of images, audio, and video outputs.
Image Assertions
Section titled “Image Assertions”image_format
Section titled “image_format”Validates that images in assistant responses match allowed formats.
Parameters:
formats([]string, required): List of allowed formats (e.g.,png,jpeg,jpg,webp,gif)
Example:
turns: - role: user parts: - type: text patterns: ["Generate a PNG image of a sunset"]
assertions: - type: image_format params: formats: - pngUse Cases:
- Validate model outputs correct image format
- Test format conversion capabilities
- Ensure compatibility with downstream systems
image_dimensions
Section titled “image_dimensions”Validates image dimensions (width and height) in assistant responses.
Parameters:
width(int, optional): Exact required width in pixelsheight(int, optional): Exact required height in pixelsmin_width(int, optional): Minimum width in pixelsmax_width(int, optional): Maximum width in pixelsmin_height(int, optional): Minimum height in pixelsmax_height(int, optional): Maximum height in pixels
Example:
turns: - role: user parts: - type: text patterns: ["Create a 1920x1080 wallpaper"]
assertions: # Exact dimensions - type: image_dimensions params: width: 1920 height: 1080turns: - role: user parts: - type: text patterns: ["Generate a thumbnail"]
assertions: # Size range - type: image_dimensions params: min_width: 100 max_width: 400 min_height: 100 max_height: 400Use Cases:
- Validate exact resolution requirements
- Test minimum/maximum size constraints
- Verify thumbnail generation
- Ensure HD/4K resolution compliance
Audio Assertions
Section titled “Audio Assertions”audio_format
Section titled “audio_format”Validates audio format in assistant responses.
Parameters:
formats([]string, required): List of allowed formats (e.g.,mp3,wav,ogg,m4a,flac)
Example:
turns: - role: user parts: - type: text patterns: ["Generate an audio clip"]
assertions: - type: audio_format params: formats: - mp3 - wavUse Cases:
- Validate audio output format
- Test format compatibility
- Ensure codec requirements
audio_duration
Section titled “audio_duration”Validates audio duration in assistant responses.
Parameters:
min_seconds(float, optional): Minimum duration in secondsmax_seconds(float, optional): Maximum duration in seconds
Example:
turns: - role: user parts: - type: text patterns: ["Create a 30-second audio clip"]
assertions: - type: audio_duration params: min_seconds: 29 max_seconds: 31turns: - role: user parts: - type: text patterns: ["Generate a brief notification sound"]
assertions: - type: audio_duration params: max_seconds: 5Use Cases:
- Validate exact duration requirements
- Test length constraints
- Verify podcast/music length
- Ensure compliance with platform limits
Video Assertions
Section titled “Video Assertions”video_resolution
Section titled “video_resolution”Validates video resolution in assistant responses.
Parameters:
presets([]string, optional): List of resolution presetsmin_width(int, optional): Minimum width in pixelsmax_width(int, optional): Maximum width in pixelsmin_height(int, optional): Minimum height in pixelsmax_height(int, optional): Maximum height in pixels
Supported Presets:
480p,sd- Standard Definition (480 height)720p,hd- HD (720 height)1080p,fhd,full_hd- Full HD (1080 height)1440p,2k,qhd- QHD (1440 height)2160p,4k,uhd- 4K Ultra HD (2160 height)4320p,8k- 8K (4320 height)
Example with Presets:
turns: - role: user parts: - type: text patterns: ["Generate a 1080p video"]
assertions: - type: video_resolution params: presets: - 1080p - fhdExample with Dimensions:
turns: - role: user parts: - type: text patterns: ["Create a high-resolution video"]
assertions: - type: video_resolution params: min_width: 1920 min_height: 1080Use Cases:
- Validate exact resolution requirements
- Test HD/4K compliance
- Verify minimum quality standards
- Validate aspect ratios
video_duration
Section titled “video_duration”Validates video duration in assistant responses.
Parameters:
min_seconds(float, optional): Minimum duration in secondsmax_seconds(float, optional): Maximum duration in seconds
Example:
turns: - role: user parts: - type: text patterns: ["Create a 1-minute video clip"]
assertions: - type: video_duration params: min_seconds: 59 max_seconds: 61Use Cases:
- Validate exact duration requirements
- Test length constraints
- Verify platform compliance (e.g., TikTok 60s limit)
- Ensure streaming segment sizes
Combining Media Assertions
Section titled “Combining Media Assertions”You can combine multiple media assertions on a single turn:
turns: - role: user parts: - type: text patterns: ["Create a 30-second 4K video in MP4 format"]
assertions: # Validate format (if you add video_format validator) - type: content_includes params: patterns: ["video"]
# Validate resolution - type: video_resolution params: presets: - 4k - uhd
# Validate duration - type: video_duration params: min_seconds: 29 max_seconds: 31Complete Example Scenario
Section titled “Complete Example Scenario”apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: media-validation-complete description: Comprehensive media validation testingspec: provider: gpt-4-vision
turns: # Image validation - role: user parts: - type: text patterns: ["Generate a 1920x1080 PNG wallpaper"]
assertions: - type: image_format params: formats: [png]
- type: image_dimensions params: width: 1920 height: 1080
# Audio validation - role: user parts: - type: text patterns: ["Create a 10-second MP3 audio clip"]
assertions: - type: audio_format params: formats: [mp3]
- type: audio_duration params: min_seconds: 9 max_seconds: 11
# Video validation - role: user parts: - type: text patterns: ["Generate a 30-second 4K video"]
assertions: - type: video_resolution params: presets: [4k, uhd]
- type: video_duration params: min_seconds: 29 max_seconds: 31Media Assertion Best Practices
Section titled “Media Assertion Best Practices”Format Validation
Section titled “Format Validation”- Always specify multiple acceptable formats when possible
- Use lowercase format names for consistency
- Test format conversion capabilities
Dimension/Resolution Testing
Section titled “Dimension/Resolution Testing”- Use min/max ranges to allow for encoding variations
- Test common aspect ratios (16:9, 4:3, 9:16)
- Validate minimum quality standards
Duration Testing
Section titled “Duration Testing”- Allow small tolerance ranges (±1-2 seconds)
- Test edge cases (very short/long durations)
- Verify platform-specific limits
Performance
Section titled “Performance”- Media assertions execute on assistant responses only
- No API calls are made for validation
- Assertions run in parallel with other validators
Example Test Scenarios
Section titled “Example Test Scenarios”See complete examples in examples/arena-media-test/:
image-validation.yaml- Image format and dimension testingaudio-validation.yaml- Audio format and duration testingvideo-validation.yaml- Video resolution and duration testing
Tips & Best Practices
Section titled “Tips & Best Practices”Execution Performance
Section titled “Execution Performance”# Increase concurrency for faster executionpromptarena run --concurrency 10
# Reduce concurrency for stabilitypromptarena run --concurrency 1Cost Control
Section titled “Cost Control”# Use mock provider during developmentpromptarena run --mock-provider
# Test with cheaper models firstpromptarena run --provider gpt-3.5-turboReproducibility
Section titled “Reproducibility”# Always use same seed for consistent resultspromptarena run --seed 42
# Document seed in test reportspromptarena run --seed 42 --format json,htmlDebugging Tips
Section titled “Debugging Tips”# Always start with config validationpromptarena config-inspect --verbose
# Use verbose mode to see API callspromptarena run --verbose --scenario problematic-test
# Test prompt generation separatelypromptarena prompt-debug --scenario scenarios/test.yamlNext Steps
Section titled “Next Steps”- PromptArena Getting Started - First project walkthrough
- Configuration Reference - Complete config documentation
- CI/CD Integration - Running in pipelines
Need Help?
# General helppromptarena --help
# Command-specific helppromptarena run --helppromptarena config-inspect --help