LLM Testing Philosophy
Understanding the principles and rationale behind PromptArena’s approach to LLM testing.
Why Test LLMs Differently?
Section titled “Why Test LLMs Differently?”Traditional software testing assumes deterministic behavior: given the same input, you get the same output. LLMs break this assumption.
The LLM Testing Challenge
Section titled “The LLM Testing Challenge”Traditional Testing:
input("2+2") → output("4") // AlwaysLLM Testing:
input("Greet the user") → output("Hello! How can I help?") → output("Hi there! What can I do for you?") → output("Greetings! How may I assist you today?")Each response is valid but different. This requires a fundamentally different testing approach.
Core Testing Principles
Section titled “Core Testing Principles”1. Behavioral Testing Over Exact Matching
Section titled “1. Behavioral Testing Over Exact Matching”Instead of testing for exact outputs, test for desired behaviors:
# ❌ Brittle: Exact matchassertions: - type: content_matches params: pattern: "^Thank you for contacting AcmeCorp support\\.$" message: "Exact wording required"
# ✅ Robust: Behavior validationassertions: - type: content_includes params: patterns: ["thank", "AcmeCorp", "support"] message: "Must acknowledge contact" - type: llm_judge params: criteria: "Response has a professional tone" judge_provider: "openai/gpt-4o-mini" message: "Must be professional" - type: llm_judge params: criteria: "Response has a positive sentiment" judge_provider: "openai/gpt-4o-mini" message: "Must be positive"Why: LLMs generate varied responses. Testing behavior allows flexibility while ensuring quality.
2. Multi-Dimensional Quality
Section titled “2. Multi-Dimensional Quality”LLM quality isn’t binary (pass/fail). It’s multi-dimensional:
- Correctness: Factually accurate?
- Relevance: Addresses the query?
- Tone: Appropriate style?
- Safety: No harmful content?
- Consistency: Maintains context?
- Performance: Fast enough?
assertions: - type: content_includes # Correctness params: patterns: ["30-day return"] message: "Must mention return policy"
- type: llm_judge # Tone params: criteria: "Response is helpful and supportive" judge_provider: "openai/gpt-4o-mini" message: "Must be helpful"
- type: content_matches # Safety (negative lookahead) params: pattern: "^(?!.*(offensive|inappropriate)).*$" message: "Must not contain inappropriate content"
- type: is_valid_json # Format params: message: "Response must be valid JSON"3. Comparative Testing
Section titled “3. Comparative Testing”Since absolute correctness is elusive, compare:
- Across providers: OpenAI vs. Claude vs. Gemini
- Across versions: GPT-4 vs. GPT-4o-mini
- Across time: Regression detection
- Against baselines: Human evaluation benchmarks
# Test same scenario across providersproviders: [openai-gpt4, claude-sonnet, gemini-pro]
# Compare results# Which handles edge cases better?# Which is faster?# Which is more cost-effective?4. Contextual Validation
Section titled “4. Contextual Validation”Context matters in LLM testing:
# Same question, different contextsturns: - name: "Technical Support Context" context: user_type: "developer" urgency: "high" turns: - role: user content: "How do I fix this error?" assertions: - type: content_includes params: patterns: ["code", "debug", "solution"]
- name: "General Inquiry Context" context: user_type: "general" urgency: "low" turns: - role: user content: "How do I fix this error?" assertions: - type: content_includes params: patterns: ["help", "guide", "steps"] message: "Must provide helpful guidance" - type: llm_judge params: criteria: "Response is beginner-friendly and easy to understand" judge_provider: "openai/gpt-4o-mini" message: "Must be beginner-friendly"5. Failure is Data
Section titled “5. Failure is Data”In LLM testing, failures aren’t just bugs—they’re learning opportunities:
- Pattern detection: What types of queries fail?
- Edge case discovery: Where do models struggle?
- Quality tracking: How does performance change over time?
- Provider insights: Which model handles what best?
Testing Strategies
Section titled “Testing Strategies”Layered Testing Pyramid
Section titled “Layered Testing Pyramid” ┌─────────────┐ │ Exploratory │ Manual testing, edge cases │ Testing │ ├─────────────┤ │ Integration │ Multi-turn, complex scenarios │ Tests │ ├─────────────┤ │ Scenario │ Single-turn, common patterns │ Tests │ ├─────────────┤ │ Smoke │ Basic functionality, mock providers │ Tests │ └─────────────┘Implementation:
-
Smoke Tests (Fast, Mock)
- Validate configuration
- Test scenario structure
- Verify assertions work
- Run in < 30 seconds
-
Scenario Tests (Common Cases)
- Core user journeys
- Expected inputs
- Standard behaviors
- Run in < 5 minutes
-
Integration Tests (Complex)
- Multi-turn conversations
- Tool calling
- Edge cases
- Run in < 20 minutes
-
Exploratory (Human-in-loop)
- Adversarial testing
- Creative edge cases
- Quality assessment
- Ongoing
Progressive Validation
Section titled “Progressive Validation”Start simple, add complexity:
# Level 1: Structuralassertions: - type: content_matches params: pattern: ".+" message: "Response must not be empty" - type: is_valid_json # If expecting JSON params: message: "Must be valid JSON"
# Level 2: Contentassertions: - type: content_includes params: patterns: ["key information"] message: "Must contain key information" - type: content_matches params: pattern: "^.{50,}$" message: "Must be at least 50 characters"
# Level 3: Qualityassertions: - type: llm_judge params: criteria: "Response has appropriate sentiment and tone for the context" judge_provider: "openai/gpt-4o-mini" message: "Must have appropriate quality" - type: llm_judge params: criteria: "Response maintains a professional tone" judge_provider: "openai/gpt-4o-mini" message: "Must be professional"
# Level 4: Custom Business Logic# Note: Custom validators would need to be implemented as extensions# For now, use pattern matching or LLM judge for business rulesassertions: - type: llm_judge params: criteria: "Response complies with brand guidelines and voice" judge_provider: "openai/gpt-4o-mini" message: "Must meet brand compliance"Design Decisions
Section titled “Design Decisions”Why PromptPack Format?
Section titled “Why PromptPack Format?”PromptArena uses the PromptPack specification for test scenarios. Why?
Portability: Test scenarios work across:
- Different testing tools
- Different providers
- Different environments
Version Control: YAML format means:
- Git-friendly diffs
- Code review workflows
- Change tracking
Human Readable: Non-developers can:
- Write test scenarios
- Review test cases
- Understand failures
Why Provider Abstraction?
Section titled “Why Provider Abstraction?”PromptArena abstracts provider differences:
# Same scenario, different providersproviders: - type: openai model: gpt-4o - type: anthropic model: claude-3-5-sonnet - type: google model: gemini-1.5-proBenefits:
- Test portability across providers
- Easy provider switching
- Cost optimization
- Vendor independence
Why Declarative Assertions?
Section titled “Why Declarative Assertions?”Instead of code, use declarations:
# Declarative (PromptArena)assertions: - type: content_includes params: patterns: ["customer service"] message: "Must mention customer service" - type: llm_judge params: criteria: "Response has positive sentiment" judge_provider: "openai/gpt-4o-mini" message: "Must be positive"
# vs. Imperative (traditional)# assert "customer service" in response# assert analyze_sentiment(response) == "positive"Advantages:
- Non-programmers can write tests
- Consistent validation across scenarios
- Easier to maintain
- Better reporting
Why Mock Providers?
Section titled “Why Mock Providers?”Mock providers enable:
- Fast Development: Test configuration without API calls
- Cost Control: Iterate without spending
- Deterministic Testing: Predictable responses
- Offline Development: Work without internet
- CI/CD Efficiency: Fast pipeline validation
# Validate structure (< 10 seconds, $0)promptarena run --mock-provider
# Validate quality (~ 5 minutes, ~$0.05)promptarena run --provider openai-gpt4o-miniAnti-Patterns to Avoid
Section titled “Anti-Patterns to Avoid”❌ Over-Specification
Section titled “❌ Over-Specification”# Too rigidassertions: - type: content_matches params: pattern: "^Thank you for contacting support\\. Our business hours are 9am-5pm\\.$" message: "Exact match required"Problem: Brittle. Any wording change breaks the test.
Better:
assertions: - type: content_includes params: patterns: ["thank", "support", "business hours"] message: "Must acknowledge support contact" - type: llm_judge params: criteria: "Response has a professional tone" judge_provider: "openai/gpt-4o-mini" message: "Must be professional"❌ Under-Specification
Section titled “❌ Under-Specification”# Too looseassertions: - type: content_matches params: pattern: ".+" message: "Must not be empty"Problem: Accepts any garbage output.
Better:
assertions: - type: content_matches params: pattern: ".+" message: "Must not be empty" - type: content_includes params: patterns: ["relevant", "keywords"] message: "Must contain relevant content" - type: content_matches params: pattern: "^.{50,}$" message: "Must be at least 50 characters" - type: llm_judge params: criteria: "Response is appropriate and helpful for the context" judge_provider: "openai/gpt-4o-mini" message: "Must be appropriate"❌ Flaky Tests
Section titled “❌ Flaky Tests”# Assumes specific response structureassertions: - type: content_matches params: pattern: "^Hello.*\\nHow can I help\\?$" message: "Exact format required"Problem: LLMs vary formatting.
Better:
assertions: - type: content_includes params: patterns: ["hello", "help"] message: "Must greet and offer help" - type: llm_judge params: criteria: "Response has a welcoming tone" judge_provider: "openai/gpt-4o-mini" message: "Must be welcoming"❌ Testing Implementation, Not Behavior
Section titled “❌ Testing Implementation, Not Behavior”# Tests how, not what - too implementation-focusedassertions: - type: tools_called params: tools: ["calculate"] message: "Must use calculator tool"
conversation_assertions: - type: tool_calls_with_args params: tool: "calculate" expected_args: operation: "multiply" x: 2 y: 2 message: "Must pass exact args"Problem: Couples test to implementation details.
Better:
# Tests outcome - focuses on behaviorassertions: - type: content_includes params: patterns: ["4"] message: "Must provide correct answer" - type: llm_judge params: criteria: "Response correctly states that 2 times 2 equals 4" judge_provider: "openai/gpt-4o-mini" message: "Must be factually correct"Quality Metrics
Section titled “Quality Metrics”What to Measure
Section titled “What to Measure”Primary Metrics:
- Pass Rate: Percentage of assertions passing
- Response Time: Latency of responses
- Cost: API spending per test run
- Coverage: Scenarios tested vs. total scenarios
Secondary Metrics:
- Failure Patterns: Which types of tests fail most?
- Provider Comparison: Which model performs best?
- Regression Detection: Are we improving or degrading?
- Edge Case Coverage: How many corner cases tested?
Setting Thresholds
Section titled “Setting Thresholds”# Quality gatesquality_gates: min_pass_rate: 0.95 # 95% of assertions must pass max_response_time: 3 # 3 seconds max max_cost_per_run: 0.50 # $0.50 per test run min_scenarios: 50 # At least 50 scenariosTesting in Production
Section titled “Testing in Production”A/B Testing LLM Changes
Section titled “A/B Testing LLM Changes”# Test new prompt vs. old promptturns: - name: "Baseline Prompt" prompt_version: "v1.0" baseline: true
- name: "Candidate Prompt" prompt_version: "v2.0" compare_to_baseline: true improvement_threshold: 0.05 # 5% betterMonitoring and Alerting
Section titled “Monitoring and Alerting”# Continuous testingschedule: "*/6 * * * *" # Every 6 hours
alerts: - condition: pass_rate < 0.90 action: notify_team
- condition: response_time > 5 action: page_oncall
- condition: cost > daily_budget action: disable_testsThe Human Factor
Section titled “The Human Factor”When to Use Human Evaluation
Section titled “When to Use Human Evaluation”LLMs require human judgment for:
- Subjective quality: Is this response “good”?
- Creative content: Is this engaging/interesting?
- Nuanced errors: Technically correct but contextually wrong
- Benchmark creation: Ground truth for automated tests
Hybrid Approach:
Human Eval → Ground Truth → Automated Tests → Continuous ValidationHuman-in-the-Loop Testing
Section titled “Human-in-the-Loop Testing”apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: requires-human-review
spec: task_type: test description: "Requires Human Review"
tags: [human-review]
turns: - role: user content: "Complex ethical question" human_evaluation: required: true criteria: - appropriateness - thoughtfulness - ethical_handlingConclusion
Section titled “Conclusion”LLM testing is fundamentally different from traditional testing:
- Embrace non-determinism: Test behaviors, not exact outputs
- Think multi-dimensionally: Quality has many facets
- Compare relatively: Benchmark against alternatives
- Iterate continuously: Quality improves over time
- Balance automation and human judgment: Both are essential
PromptArena embodies these principles, providing a framework for robust, maintainable LLM testing that scales from development to production.
Further Reading
Section titled “Further Reading”- Scenario Design Principles - How to structure effective test scenarios
- Provider Comparison Guide - Understanding provider differences
- Validation Strategies - Choosing the right assertions