LLM Testing Philosophy

Understanding the principles and rationale behind PromptArena’s approach to LLM testing.

Why Test LLMs Differently?

Traditional software testing assumes deterministic behavior: given the same input, you get the same output. LLMs break this assumption.

The LLM Testing Challenge

Traditional Testing:

input("2+2") → output("4")  // Always

LLM Testing:

input("Greet the user") → output("Hello! How can I help?")
                        → output("Hi there! What can I do for you?")
                        → output("Greetings! How may I assist you today?")

Each response is valid but different. This requires a fundamentally different testing approach.

Core Testing Principles

1. Behavioral Testing Over Exact Matching

Instead of testing for exact outputs, test for desired behaviors:

# ❌ Brittle: Exact match
assertions:
  - type: content_matches
    params:
      pattern: "^Thank you for contacting AcmeCorp support\\.$"
      message: "Exact wording required"

# ✅ Robust: Behavior validation
assertions:
  - type: content_includes
    params:
      patterns: ["thank", "AcmeCorp", "support"]
      message: "Must acknowledge contact"
  - type: llm_judge
    params:
      criteria: "Response has a professional tone"
      judge_provider: "openai/gpt-4o-mini"
      message: "Must be professional"
  - type: llm_judge
    params:
      criteria: "Response has a positive sentiment"
      judge_provider: "openai/gpt-4o-mini"
      message: "Must be positive"

Why: LLMs generate varied responses. Testing behavior allows flexibility while ensuring quality.

2. Multi-Dimensional Quality

LLM quality isn’t binary (pass/fail). It’s multi-dimensional:

Correctness: Factually accurate?
Relevance: Addresses the query?
Tone: Appropriate style?
Safety: No harmful content?
Consistency: Maintains context?
Performance: Fast enough?

assertions:
  - type: content_includes   # Correctness
    params:
      patterns: ["30-day return"]
      message: "Must mention return policy"

  - type: llm_judge          # Tone
    params:
      criteria: "Response is helpful and supportive"
      judge_provider: "openai/gpt-4o-mini"
      message: "Must be helpful"

  - type: content_matches    # Safety (negative lookahead)
    params:
      pattern: "^(?!.*(offensive|inappropriate)).*$"
      message: "Must not contain inappropriate content"

  - type: is_valid_json      # Format
    params:
      message: "Response must be valid JSON"

3. Comparative Testing

Since absolute correctness is elusive, compare:

Across providers: OpenAI vs. Claude vs. Gemini
Across versions: GPT-4 vs. GPT-4o-mini
Across time: Regression detection
Against baselines: Human evaluation benchmarks

# Test same scenario across providers
providers: [openai-gpt4, claude-sonnet, gemini-pro]

# Compare results
# Which handles edge cases better?
# Which is faster?
# Which is more cost-effective?

4. Contextual Validation

Context matters in LLM testing:

# Same question, different contexts
turns:
  - name: "Technical Support Context"
    context:
      user_type: "developer"
      urgency: "high"
    turns:
      - role: user
        content: "How do I fix this error?"
        assertions:
          - type: content_includes
            params:
              patterns: ["code", "debug", "solution"]

  - name: "General Inquiry Context"
    context:
      user_type: "general"
      urgency: "low"
    turns:
      - role: user
        content: "How do I fix this error?"
        assertions:
          - type: content_includes
            params:
              patterns: ["help", "guide", "steps"]
              message: "Must provide helpful guidance"
          - type: llm_judge
            params:
              criteria: "Response is beginner-friendly and easy to understand"
              judge_provider: "openai/gpt-4o-mini"
              message: "Must be beginner-friendly"

5. Failure is Data

In LLM testing, failures aren’t just bugs—they’re learning opportunities:

Pattern detection: What types of queries fail?
Edge case discovery: Where do models struggle?
Quality tracking: How does performance change over time?
Provider insights: Which model handles what best?

Testing Strategies

Layered Testing Pyramid

         ┌─────────────┐
         │  Exploratory │  Manual testing, edge cases
         │   Testing    │
         ├─────────────┤
         │ Integration  │  Multi-turn, complex scenarios
         │    Tests     │
         ├─────────────┤
         │  Scenario    │  Single-turn, common patterns
         │   Tests      │
         ├─────────────┤
         │   Smoke      │  Basic functionality, mock providers
         │   Tests      │
         └─────────────┘

Implementation:

Smoke Tests (Fast, Mock)
- Validate configuration
- Test scenario structure
- Verify assertions work
- Run in < 30 seconds
Scenario Tests (Common Cases)
- Core user journeys
- Expected inputs
- Standard behaviors
- Run in < 5 minutes
Integration Tests (Complex)
- Multi-turn conversations
- Tool calling
- Edge cases
- Run in < 20 minutes
Exploratory (Human-in-loop)
- Adversarial testing
- Creative edge cases
- Quality assessment
- Ongoing

Progressive Validation

Start simple, add complexity:

# Level 1: Structural
assertions:
  - type: content_matches
    params:
      pattern: ".+"
      message: "Response must not be empty"
  - type: is_valid_json  # If expecting JSON
    params:
      message: "Must be valid JSON"

# Level 2: Content
assertions:
  - type: content_includes
    params:
      patterns: ["key information"]
      message: "Must contain key information"
  - type: content_matches
    params:
      pattern: "^.{50,}$"
      message: "Must be at least 50 characters"

# Level 3: Quality
assertions:
  - type: llm_judge
    params:
      criteria: "Response has appropriate sentiment and tone for the context"
      judge_provider: "openai/gpt-4o-mini"
      message: "Must have appropriate quality"
  - type: llm_judge
    params:
      criteria: "Response maintains a professional tone"
      judge_provider: "openai/gpt-4o-mini"
      message: "Must be professional"

# Level 4: Custom Business Logic
# Note: Custom validators would need to be implemented as extensions
# For now, use pattern matching or LLM judge for business rules
assertions:
  - type: llm_judge
    params:
      criteria: "Response complies with brand guidelines and voice"
      judge_provider: "openai/gpt-4o-mini"
      message: "Must meet brand compliance"

Design Decisions

Why PromptPack Format?

PromptArena uses the PromptPack specification for test scenarios. Why?

Portability: Test scenarios work across:

Different testing tools
Different providers
Different environments

Version Control: YAML format means:

Git-friendly diffs
Code review workflows
Change tracking

Human Readable: Non-developers can:

Write test scenarios
Review test cases
Understand failures

Why Provider Abstraction?

PromptArena abstracts provider differences:

# Same scenario, different providers
providers:
  - type: openai
    model: gpt-4o
  - type: anthropic
    model: claude-3-5-sonnet
  - type: google
    model: gemini-1.5-pro

Benefits:

Test portability across providers
Easy provider switching
Cost optimization
Vendor independence

Why Declarative Assertions?

Instead of code, use declarations:

# Declarative (PromptArena)
assertions:
  - type: content_includes
    params:
      patterns: ["customer service"]
      message: "Must mention customer service"
  - type: llm_judge
    params:
      criteria: "Response has positive sentiment"
      judge_provider: "openai/gpt-4o-mini"
      message: "Must be positive"

# vs. Imperative (traditional)
# assert "customer service" in response
# assert analyze_sentiment(response) == "positive"

Advantages:

Non-programmers can write tests
Consistent validation across scenarios
Easier to maintain
Better reporting

Why Mock Providers?

Mock providers enable:

Fast Development: Test configuration without API calls
Cost Control: Iterate without spending
Deterministic Testing: Predictable responses
Offline Development: Work without internet
CI/CD Efficiency: Fast pipeline validation

# Validate structure (< 10 seconds, $0)
promptarena run --mock-provider

# Validate quality (~ 5 minutes, ~$0.05)
promptarena run --provider openai-gpt4o-mini

Anti-Patterns to Avoid

❌ Over-Specification

# Too rigid
assertions:
  - type: content_matches
    params:
      pattern: "^Thank you for contacting support\\. Our business hours are 9am-5pm\\.$"
      message: "Exact match required"

Problem: Brittle. Any wording change breaks the test.

Better:

assertions:
  - type: content_includes
    params:
      patterns: ["thank", "support", "business hours"]
      message: "Must acknowledge support contact"
  - type: llm_judge
    params:
      criteria: "Response has a professional tone"
      judge_provider: "openai/gpt-4o-mini"
      message: "Must be professional"

❌ Under-Specification

# Too loose
assertions:
  - type: content_matches
    params:
      pattern: ".+"
      message: "Must not be empty"

Problem: Accepts any garbage output.

Better:

assertions:
  - type: content_matches
    params:
      pattern: ".+"
      message: "Must not be empty"
  - type: content_includes
    params:
      patterns: ["relevant", "keywords"]
      message: "Must contain relevant content"
  - type: content_matches
    params:
      pattern: "^.{50,}$"
      message: "Must be at least 50 characters"
  - type: llm_judge
    params:
      criteria: "Response is appropriate and helpful for the context"
      judge_provider: "openai/gpt-4o-mini"
      message: "Must be appropriate"

❌ Flaky Tests

# Assumes specific response structure
assertions:
  - type: content_matches
    params:
      pattern: "^Hello.*\\nHow can I help\\?$"
      message: "Exact format required"

Problem: LLMs vary formatting.

Better:

assertions:
  - type: content_includes
    params:
      patterns: ["hello", "help"]
      message: "Must greet and offer help"
  - type: llm_judge
    params:
      criteria: "Response has a welcoming tone"
      judge_provider: "openai/gpt-4o-mini"
      message: "Must be welcoming"

❌ Testing Implementation, Not Behavior

# Tests how, not what - too implementation-focused
assertions:
  - type: tools_called
    params:
      tools: ["calculate"]
      message: "Must use calculator tool"

conversation_assertions:
  - type: tool_calls_with_args
    params:
      tool: "calculate"
      expected_args:
        operation: "multiply"
        x: 2
        y: 2
      message: "Must pass exact args"

Problem: Couples test to implementation details.

Better:

# Tests outcome - focuses on behavior
assertions:
  - type: content_includes
    params:
      patterns: ["4"]
      message: "Must provide correct answer"
  - type: llm_judge
    params:
      criteria: "Response correctly states that 2 times 2 equals 4"
      judge_provider: "openai/gpt-4o-mini"
      message: "Must be factually correct"

Quality Metrics

What to Measure

Primary Metrics:

Pass Rate: Percentage of assertions passing
Response Time: Latency of responses
Cost: API spending per test run
Coverage: Scenarios tested vs. total scenarios

Secondary Metrics:

Failure Patterns: Which types of tests fail most?
Provider Comparison: Which model performs best?
Regression Detection: Are we improving or degrading?
Edge Case Coverage: How many corner cases tested?

Setting Thresholds

# Quality gates
quality_gates:
  min_pass_rate: 0.95      # 95% of assertions must pass
  max_response_time: 3     # 3 seconds max
  max_cost_per_run: 0.50   # $0.50 per test run
  min_scenarios: 50        # At least 50 scenarios

Testing in Production

A/B Testing LLM Changes

# Test new prompt vs. old prompt
turns:
  - name: "Baseline Prompt"
    prompt_version: "v1.0"
    baseline: true

  - name: "Candidate Prompt"
    prompt_version: "v2.0"
    compare_to_baseline: true
    improvement_threshold: 0.05  # 5% better

Monitoring and Alerting

# Continuous testing
schedule: "*/6 * * * *"  # Every 6 hours

alerts:
  - condition: pass_rate < 0.90
    action: notify_team

  - condition: response_time > 5
    action: page_oncall

  - condition: cost > daily_budget
    action: disable_tests

The Human Factor

When to Use Human Evaluation

LLMs require human judgment for:

Subjective quality: Is this response “good”?
Creative content: Is this engaging/interesting?
Nuanced errors: Technically correct but contextually wrong
Benchmark creation: Ground truth for automated tests

Hybrid Approach:

Human Eval → Ground Truth → Automated Tests → Continuous Validation

Human-in-the-Loop Testing

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: requires-human-review

spec:
  task_type: test
  description: "Requires Human Review"

    tags: [human-review]

    turns:
      - role: user
        content: "Complex ethical question"
        human_evaluation:
          required: true
          criteria:
            - appropriateness
            - thoughtfulness
            - ethical_handling

Conclusion

LLM testing is fundamentally different from traditional testing:

Embrace non-determinism: Test behaviors, not exact outputs
Think multi-dimensionally: Quality has many facets
Compare relatively: Benchmark against alternatives
Iterate continuously: Quality improves over time
Balance automation and human judgment: Both are essential

PromptArena embodies these principles, providing a framework for robust, maintainable LLM testing that scales from development to production.

LLM Testing Philosophy

Why Test LLMs Differently?

The LLM Testing Challenge

Core Testing Principles

1. Behavioral Testing Over Exact Matching

2. Multi-Dimensional Quality

3. Comparative Testing

4. Contextual Validation

5. Failure is Data

Testing Strategies

Layered Testing Pyramid

Progressive Validation

Design Decisions

Why PromptPack Format?

Why Provider Abstraction?

Why Declarative Assertions?

Why Mock Providers?

Anti-Patterns to Avoid

❌ Over-Specification

❌ Under-Specification

❌ Flaky Tests

❌ Testing Implementation, Not Behavior

Quality Metrics

What to Measure

Setting Thresholds

Testing in Production

A/B Testing LLM Changes

Monitoring and Alerting

The Human Factor

When to Use Human Evaluation

Human-in-the-Loop Testing

Conclusion

Further Reading