Tutorial 2: Multi-Provider Testing
Learn how to test the same scenario across multiple LLM providers and compare their responses.
What You’ll Learn
Section titled “What You’ll Learn”- Configure multiple LLM providers (OpenAI, Claude, Gemini)
- Run the same test across all providers
- Compare provider responses
- Understand provider-specific behaviors
Prerequisites
Section titled “Prerequisites”- Completed Tutorial 1: Your First Test
- API keys for providers you want to test (at least 2)
Why Multi-Provider Testing?
Section titled “Why Multi-Provider Testing?”Different LLM providers have unique strengths:
- Response style: Formal vs. conversational
- Accuracy: Factual correctness varies
- Speed: Response time differences
- Cost: Pricing varies significantly
- Capabilities: Tool calling, vision, etc.
Testing across providers helps you:
- Choose the best model for your use case
- Validate consistency across providers
- Build fallback strategies
- Optimize cost vs. quality
Step 1: Get API Keys
Section titled “Step 1: Get API Keys”You’ll need API keys for the providers you want to test:
OpenAI
Section titled “OpenAI”Visit platform.openai.com
Anthropic (Claude)
Section titled “Anthropic (Claude)”Visit console.anthropic.com
Google (Gemini)
Section titled “Google (Gemini)”Visit aistudio.google.com
Step 2: Set Up Environment
Section titled “Step 2: Set Up Environment”# Add all API keys to your environmentexport OPENAI_API_KEY="sk-..."export ANTHROPIC_API_KEY="sk-ant-..."export GOOGLE_API_KEY="..."
# Or add to ~/.zshrc for persistencecat >> ~/.zshrc << 'EOF'export OPENAI_API_KEY="sk-..."export ANTHROPIC_API_KEY="sk-ant-..."export GOOGLE_API_KEY="..."EOF
source ~/.zshrcStep 3: Configure Multiple Providers
Section titled “Step 3: Configure Multiple Providers”Create provider configurations:
OpenAI
Section titled “OpenAI”providers/openai.yaml:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Providermetadata: name: openai-gpt4o-mini labels: provider: openai
spec: type: openai model: gpt-4o-mini
defaults: temperature: 0.7 max_tokens: 500Anthropic Claude
Section titled “Anthropic Claude”providers/claude.yaml:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Providermetadata: name: claude-sonnet labels: provider: anthropic
spec: type: anthropic model: claude-3-5-sonnet-20241022
defaults: temperature: 0.7 max_tokens: 500Google Gemini
Section titled “Google Gemini”providers/gemini.yaml:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Providermetadata: name: gemini-flash labels: provider: google
spec: type: gemini model: gemini-1.5-flash
defaults: temperature: 0.7 max_tokens: 500Step 4: Create a Comparison Test
Section titled “Step 4: Create a Comparison Test”Create scenarios/customer-support.yaml:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: customer-support labels: category: customer-service priority: comparison
spec: task_type: support
turns: - role: user content: "I'm having trouble logging into my account. Can you help?" assertions: - type: content_includes params: patterns: ["account"] message: "Should acknowledge account issue"
- type: content_length params: max: 300 message: "Keep response concise"
- role: user content: "I've tried resetting my password but didn't receive an email." assertions: - type: content_includes params: patterns: ["email"] message: "Should address email issue"Step 5: Update Arena Configuration
Section titled “Step 5: Update Arena Configuration”Edit arena.yaml:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Arenametadata: name: multi-provider-test
spec: prompt_configs: - id: support file: prompts/support.yaml
providers: - file: providers/openai.yaml - file: providers/claude.yaml - file: providers/gemini.yaml
scenarios: - file: scenarios/customer-support.yamlStep 6: Run Multi-Provider Tests
Section titled “Step 6: Run Multi-Provider Tests”# Run tests across ALL configured providerspromptarena runOutput:
🚀 PromptArena Starting...
Loading configuration... ✓ Loaded 1 prompt config ✓ Loaded 3 providers (openai-mini, claude-sonnet, gemini-flash) ✓ Loaded 1 scenario
Running tests (3 providers × 1 scenario × 2 turns = 6 test executions)... ✓ Product Inquiry - Turn 1 [openai-mini] (1.2s) ✓ Product Inquiry - Turn 1 [claude-sonnet] (1.5s) ✓ Product Inquiry - Turn 1 [gemini-flash] (0.8s) ✓ Product Inquiry - Turn 2 [openai-mini] (1.3s) ✓ Product Inquiry - Turn 2 [claude-sonnet] (1.4s) ✓ Product Inquiry - Turn 2 [gemini-flash] (0.9s)
Results by Provider: openai-mini: 2/2 passed (100%) claude-sonnet: 2/2 passed (100%) gemini-flash: 2/2 passed (100%)
Overall: 6/6 passed (100%)Step 7: Generate Comparison Report
Section titled “Step 7: Generate Comparison Report”# Generate HTML report with all provider resultspromptarena run --format html
# Open the reportopen out/report-*.htmlThe HTML report shows side-by-side provider responses for easy comparison.
Step 8: Test Specific Providers
Section titled “Step 8: Test Specific Providers”Sometimes you want to test just one or two providers:
# Test only OpenAIpromptarena run --provider openai-mini
# Test OpenAI and Claudepromptarena run --provider openai-mini,claude-sonnet
# Test everything except Geminipromptarena run --provider openai-mini,claude-sonnetAnalyzing Provider Differences
Section titled “Analyzing Provider Differences”Response Style Comparison
Section titled “Response Style Comparison”Create scenarios/style-test.yaml:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: style-test
spec: task_type: support
turns: - role: user content: "Explain how your product works" assertions: - type: content_includes params: patterns: ["feature"] message: "Should explain features"
- type: content_length params: min: 50 max: 500 message: "Response should be substantial but not excessive"Run and compare:
promptarena run --scenario style-test --format json
# View detailed responsescat out/results.json | jq '.results[] | {provider: .provider, response: .response}'Performance Comparison
Section titled “Performance Comparison”Check response times:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: performance-test
spec: task_type: support
turns: - role: user content: "Quick question: what's your return policy?" assertions:params: max_seconds: 2 message: "All providers should respond quickly"Cost Analysis
Section titled “Cost Analysis”Different providers have different pricing:
| Provider | Model | Cost (per 1M tokens) |
|---|---|---|
| OpenAI | gpt-4o-mini | Input: $0.15, Output: $0.60 |
| Anthropic | claude-3-5-sonnet | Input: $3.00, Output: $15.00 |
| gemini-1.5-flash | Input: $0.075, Output: $0.30 |
Generate a cost report:
promptarena run --format json
# Calculate costs (example with jq)cat out/results.json | jq ' .results | group_by(.provider) | map({ provider: .[0].provider, total_turns: length, avg_response_time: (map(.response_time) | add / length) })'Advanced: Provider-Specific Tests
Section titled “Advanced: Provider-Specific Tests”Test provider-specific features:
Testing Structured Output (OpenAI)
Section titled “Testing Structured Output (OpenAI)”apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: json-response-test labels: provider-specific: openai
spec: task_type: support
turns: - role: user content: "Return user info as JSON with name and email" assertions: - type: is_valid_json params: message: "Response should be valid JSON"Testing Long Context (Claude)
Section titled “Testing Long Context (Claude)”apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: long-context-test labels: provider-specific: claude
spec: task_type: support description: "Claude excels at long context"
turns: - role: user content: "Summarize this document" assertions: - type: content_includes params: patterns: ["key points"] message: "Should identify key points"Best Practices
Section titled “Best Practices”1. Keep Parameters Consistent
Section titled “1. Keep Parameters Consistent”Use the same temperature and max_tokens across providers for fair comparison:
# All providersspec: defaults: temperature: 0.7 max_tokens: 5002. Provider-Agnostic Assertions
Section titled “2. Provider-Agnostic Assertions”Write assertions that work across all providers:
# ✅ Good - flexibleassertions: - type: content_includes params: patterns: ["help"] message: "Should offer help"
# ❌ Avoid - too specific to one provider's styleassertions: patterns: ["I'd be happy to help you with that!"]3. Use Meaningful Names
Section titled “3. Use Meaningful Names”Give providers descriptive names:
# providers/openai-creative.yamlapiVersion: promptkit.altairalabs.ai/v1alpha1kind: Providermetadata: name: openai-creative
spec: type: openai model: gpt-4o-mini defaults: temperature: 0.9
# providers/openai-precise.yamlapiVersion: promptkit.altairalabs.ai/v1alpha1kind: Providermetadata: name: openai-precise
spec: type: openai model: gpt-4o-mini defaults: temperature: 0.1Test configuration variants:
promptarena run --provider creative-mini,precise-mini4. Document Provider Behavior
Section titled “4. Document Provider Behavior”Add comments to your scenarios:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: customer-support-response annotations: notes: | Claude tends to be more verbose OpenAI more concise, Gemini fastest
spec: task_type: support
turns: - role: user content: "Help with order tracking"Common Issues
Section titled “Common Issues”Missing API Key
Section titled “Missing API Key”# Verify keys are setecho $OPENAI_API_KEYecho $ANTHROPIC_API_KEYecho $GOOGLE_API_KEYProvider Not Found
Section titled “Provider Not Found”# Check provider configurationpromptarena config-inspect
# Should list all providersRate Limiting
Section titled “Rate Limiting”# Reduce concurrency to avoid rate limitspromptarena run --concurrency 1
# Or test one provider at a timepromptarena run --provider openai-miniNext Steps
Section titled “Next Steps”You now know how to test across multiple providers!
Continue learning:
- Tutorial 3: Multi-Turn Conversations - Build complex dialog flows
- Tutorial 4: MCP Tools - Test tool/function calling
- How-To: Configure Providers - Advanced provider setup
Try this:
- Add more providers (Azure OpenAI, Groq, etc.)
- Create provider-specific test suites
- Build a cost optimization analysis
- Test the same prompt across different model versions
What’s Next?
Section titled “What’s Next?”In Tutorial 3, you’ll learn how to create multi-turn conversation tests that maintain context across multiple exchanges.