PromptArena

Comprehensive testing framework for validating LLM prompts across multiple providers

What is PromptArena?

PromptArena is a powerful testing tool that helps you:

Test prompts systematically across OpenAI, Anthropic, Google, and more
Compare provider performance side-by-side with detailed metrics
Validate conversation flows with multi-turn testing scenarios
Integrate with CI/CD to catch prompt regressions before production
Generate comprehensive reports with HTML, JSON, and markdown output

Quick Start

Get up and running in 60 seconds with the interactive project generator:

# Install PromptKit (includes PromptArena)
brew install promptkit

# Or with Go
go install github.com/AltairaLabs/PromptKit/tools/arena@latest

# Create a new test project instantly
promprarena init my-test --quick

# Choose your provider when prompted:
#   • mock     - No API calls, instant testing
#   • openai   - OpenAI GPT models
#   • anthropic - Claude models
#   • google   - Gemini models

# Or use a built-in template for common use cases:
#   • basic-chatbot       - Simple conversational testing
#   • customer-support    - Support agent with tools
#   • code-assistant      - Code generation & review
#   • content-generation  - Creative content testing
#   • multimodal          - Image/audio/video AI
#   • mcp-integration     - MCP server testing

# Run your first test
cd my-test
promptarena run

That’s it! The init command creates:

✅ Complete Arena configuration
✅ Provider setup (ready to use)
✅ Sample test scenario
✅ Working prompt configuration
✅ README with next steps

Need More Control?

Use interactive mode for custom configuration:

promptarena init my-project
# Answer prompts to customize:
#   - Project name and description
#   - Provider selection
#   - System prompt customization
#   - Test scenario setup

Or skip the wizard and create files manually (see below).

Next: Your First Arena Test Tutorial

Documentation by Type

📚 Tutorials (Learn by Doing)

Step-by-step guides that teach you Arena through hands-on exercises:

Your First Test - Get started in 5 minutes
Multi-Provider Testing - Compare providers
Multi-Turn Conversations - Test conversation flows
MCP Tool Integration - Test with tool calling
CI/CD Integration - Automate testing

🔧 How-To Guides (Accomplish Specific Tasks)

Focused guides for specific Arena tasks:

Installation - Get Arena running
Write Test Scenarios - Effective scenario design
Configure Providers - Provider setup
Use Mock Providers - Test without API calls
Validate Outputs - Assertion strategies
Customize Reports - Report formatting
Integrate CI/CD - GitHub Actions, GitLab CI
Session Recording - Capture and replay sessions

💡 Explanation (Understand the Concepts)

Deep dives into Arena’s design and philosophy:

Testing Philosophy - Why test prompts?
Scenario Design - Effective test patterns
Provider Comparison - Evaluate providers
Validation Strategies - Assertion best practices
Session Recording - Recording architecture and replay

📖 Reference (Look Up Details)

Complete technical specifications:

CLI Commands - All Arena commands
Configuration Schema - Config file format
Scenario Format - Test scenario structure
Assertions - All assertion types
Validators - Built-in validators
Output Formats - Report formats

Key Features

Multi-Provider Testing

Test the same prompt across different LLM providers simultaneously:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Arena
metadata:
  name: cross-provider-test

spec:
  providers:
    - path: ./providers/openai.yaml
    - path: ./providers/claude.yaml
    - path: ./providers/gemini.yaml

  scenarios:
    - path: ./scenarios/quantum-test.yaml
      providers: [openai-gpt4, claude-sonnet, gemini-pro]

Rich Assertions

Validate outputs with powerful assertions:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: quantum-test

spec:
  turns:
    - role: user
      content: "Explain quantum computing"
      assertions:
        - type: content_includes
          params:
            patterns: ["quantum"]
            message: "Should mention quantum"

        - type: content_matches
          params:
            pattern: "(qubit|superposition|entanglement)"
            message: "Should mention key quantum concepts"

Performance Metrics

Automatically track:

Response time (latency)
Token usage (input/output)
Cost estimation
Success/failure rates

CI/CD Integration

Run tests in your pipeline:

# .github/workflows/test-prompts.yml
- name: Test Prompts
  run: promptarena run --ci --fail-on-error

Use Cases

For Prompt Engineers

Develop and refine prompts with confidence
A/B test different prompt variations
Ensure consistency across providers
Track performance over time

For QA Teams

Validate prompt quality before deployment
Catch regressions in prompt behavior
Test edge cases and failure modes
Generate test reports for stakeholders

For ML Ops

Integrate prompt testing into CI/CD
Monitor prompt performance
Compare provider costs and quality
Automate regression testing

Examples

Real-world Arena testing scenarios:

Customer Support Testing - Multi-turn support conversations
MCP Chatbot Testing - Tool calling validation
Guardrails Testing - Safety and compliance checks
Multi-Provider Comparison - Provider evaluation

Common Workflows

Development Workflow

Write prompt → 2. Create test → 3. Run Arena → 4. Refine → 5. Repeat

CI/CD Workflow

Push changes → 2. Arena runs automatically → 3. Tests must pass → 4. Deploy

Provider Evaluation

Define test suite → 2. Run across providers → 3. Compare results → 4. Choose best

Getting Help

Quick Start: Getting Started Guide
Questions: GitHub Discussions
Issues: Report a Bug
Examples: Arena Examples

PackC: Compile tested prompts for production
SDK: Use tested prompts in applications
Complete Workflow: See all tools together