Validation Strategies

Comprehensive guide to designing effective validation and assertion strategies for LLM testing.

The Validation Challenge

LLM outputs are non-deterministic and variable. Traditional exact-match testing doesn’t work:

# ❌ This will fail - too rigid
assertions:
  - type: content_matches
    params:
      pattern: "^The capital of France is Paris\\.$"
      message: "Exact match required"

# LLM might say:
# - "Paris is the capital of France."
# - "The capital of France is Paris, France."
# - "France's capital city is Paris."

The core challenge: Validate intent and correctness without demanding exact wording.

Validation Principles

1. Test Behavior, Not Words

Focus on what the response achieves, not how it’s phrased:

# ✅ Good: Tests behavior
assertions:
  - type: content_includes
    params:
      patterns: ["Paris"]
      message: "Should mention Paris"

# ❌ Bad: Tests exact wording
assertions:
  - type: content_matches
    params:
      pattern: "^The capital is Paris$"
      message: "Exact match"

2. Layer Your Validations

Use multiple validation types from loose to strict:

assertions:
  # Layer 1: Basic content presence
  - type: content_includes
    params:
      patterns: ["key", "terms"]
  
  # Layer 2: Structural validation
  - type: is_valid_json
    params:
      message: "Must be valid JSON"
  
  # Layer 3: Schema validation
  - type: json_schema
    params:
      schema:
        type: object
        required: ["expected_field"]
  
  # Layer 4: Pattern matching
  - type: content_matches
    params:
      pattern: "business_rule_pattern"

3. Tolerate Variation

Build assertions that accept legitimate variation:

# ✅ Flexible
assertions:
  - type: content_matches
    params:
      pattern: "(refund|money back|return funds)"
      message: "Should mention refund option"
  
# ❌ Too rigid
assertions:
  - type: content_includes
    params:
      patterns: ["refund policy"]
      message: "Must say exactly 'refund policy'"

4. Fail Fast, Fail Clear

Design assertions that fail with helpful messages:

assertions:
  - type: content_includes
    params:
      patterns: ["critical_info"]
      message: "Missing required policy information"
  
  - type: content_matches
    params:
      pattern: "^(?!.*(harmful|inappropriate)).*$"
      message: "Response contains inappropriate content"

Validation Types

Content-Based Validation

String Contains

Check for required content:

# Single term
assertions:
  - type: content_includes
    params:
      patterns: ["Paris"]

# Multiple terms (all must be present)
assertions:
  - type: content_includes
    params:
      patterns: ["Paris", "France", "capital"]

# Any term (at least one must be present)
assertions:
  - type: content_matches
    params:
      pattern: "(Paris|France's capital|French capital)"

Use when:

Limitations:

Regular Expressions

Pattern matching for structured content:

# Phone number format
assertions:
  - type: regex
    value: "\\+?1?\\d{9,15}"

# Email address
assertions:
  - type: regex
    value: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"

# Date format (YYYY-MM-DD)
assertions:
  - type: regex
    value: "\\d{4}-\\d{2}-\\d{2}"

Use when:

Best practices:

String Length

Validate response length using regex patterns:

# Exact length (100 characters)
assertions:
  - type: content_matches
    params:
      pattern: "^.{100}$"
      message: "Response must be exactly 100 characters"

# Range (50-200 characters)
assertions:
  - type: content_matches
    params:
      pattern: "^.{50,200}$"
      message: "Response must be 50-200 characters"

# Maximum (conciseness test - up to 150 chars)
assertions:
  - type: content_matches
    params:
      pattern: "^.{1,150}$"
      message: "Response must be at most 150 characters"

# Minimum (completeness test - at least 50 chars)
assertions:
  - type: content_matches
    params:
      pattern: "^.{50,}$"
      message: "Response must be at least 50 characters"

Use when:

Semantic Validation

Semantic validation can be implemented using custom validators or by combining multiple content assertions:

turns:
  - role: user
    content: "What's the capital of France?"
    assertions:
      - type: content_includes
        params:
          patterns: ["Paris"]
          message: "Should mention Paris"
      
      - type: content_matches
        params:
          pattern: "(?i)(capital|city)"
          message: "Should reference capital/city"

Use when:

Sentiment Analysis Analysis

Sentiment and tone can be checked using pattern matching:

turns:
  - role: user
    content: "I'm frustrated with this issue"
    assertions:
      - type: content_matches
        params:
          pattern: "(?i)(understand|help|sorry|apologize)"
          message: "Should show empathy"
      
      - type: content_includes
        params:
          patterns: ["assist", "resolve"]
          message: "Should offer assistance"

Use when:

Structural Validation

JSON Validation

Validate JSON structure:

# Valid JSON
assertions:
  - type: is_valid_json
    params:
      message: "Response must be valid JSON"

# JSON with schema
assertions:
  - type: json_schema
    params:
      schema:
        type: object
        properties:
          name:
            type: string
          age:
            type: integer
        required: [name, age]
      message: "Response must match schema"

Use when:

Example:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: extract-user-data

spec:
  task_type: extraction
  description: "Extract User Data"
  
  turns:
    - role: user
      content: "Extract: John Doe, age 30, john@example.com"
      assertions:
        - type: is_valid_json
          params:
            message: "Should return valid JSON"
        - type: json_schema
          params:
            schema:
              type: object
              properties:
                name: {type: string}
                age: {type: integer}
                email: {type: string, format: email}
              required: [name, age, email]
            message: "Should match user schema"

List/Array Validation

Validate lists in responses:

turns:
  - role: user
    content: "List the top items"
    assertions:
      # Check for multiple items with pattern
      - type: content_matches
        params:
          pattern: "item1.*item2.*item3"
          message: "Should contain all items"
      
      # Check for any option
      - type: content_matches
        params:
          pattern: "(option1|option2)"
          message: "Should contain at least one option"

Use when:

Format Compliance

Validate specific formats using pattern matching:

assertions:
  # Markdown (check for markdown syntax)
  - type: content_matches
    params:
      pattern: "(^#{1,6} |\*\*|\*|`|\[.*\]\(.*\))"
      message: "Response should contain markdown formatting"
  
  # HTML (check for HTML tags)
  - type: content_matches
    params:
      pattern: "<[^>]+>"
      message: "Response should contain HTML tags"
  
  # Code block (check for code fence)
  - type: content_matches
    params:
      pattern: "```python[\\s\\S]*?```"
      message: "Response should contain Python code block"

Negative Validation

Test what should NOT appear using negative lookahead patterns:

assertions:
  # Must not contain specific words (use negative lookahead)
  - type: content_matches
    params:
      pattern: "^(?!.*(inappropriate|offensive|harmful)).*$"
      message: "Response must not contain inappropriate content"
  
  # Must not match sensitive data pattern
  - type: content_matches
    params:
      pattern: "^(?!.*\\b(password|secret|api[_-]?key)\\b).*$"
      message: "Response must not contain sensitive data keywords"
  
  # For conversation-level "not contains" checks
  # Use conversation-level assertion:
  # - type: content_not_includes
  #   params:
  #     patterns: ["inappropriate", "offensive"]

Use when:

Example:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: no-pii-leakage

spec:
  task_type: security
  description: "No PII Leakage"
  
  turns:
    - role: user
      content: "Summarize the customer record"
      assertions:
        - type: content_matches
          params:
            pattern: "^(?!.*\\d{3}-\\d{2}-\\d{4}).*$"
            message: "Should not contain SSN"
        - type: content_matches
          params:
            pattern: "^(?!.*\\d{16}).*$"
            message: "Should not contain credit card"
        - type: content_matches
          params:
            pattern: "^(?!.*(password|secret)).*$"
            message: "Should not contain sensitive keywords"

Multi-Turn Validation

Validate conversation coherence:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: context-retention

spec:
  task_type: test
  description: "Context Retention"
  
    turns:
      - role: user
        content: "My name is Alice"
        assertions:
          - type: content_includes
            params:
              patterns: ["Alice"]
              message: "Should acknowledge the name"
      
      - role: user
        content: "What's my name?"
        assertions:
          - type: content_includes
            params:
              patterns: ["Alice"]
              message: "Should remember the name"

Validation types:

assertions:
  # References earlier context
turn_index: 0
  
  # Maintains consistency
  - type: consistent_with_turn
    turn_index: 0
  
  # State progression
  - type: state_changed
    from: "initial"
    to: "confirmed"

Validation Patterns

The Pyramid Pattern

Layer validations from basic to advanced:

assertions:
  # Base: Basic presence
  - type: content_matches
    params:
      pattern: ".+"
      message: "Response must not be empty"
  
  # Level 2: Content presence
  - type: content_includes
    params:
      patterns: ["required", "terms"]
      message: "Must contain required terms"
  
  # Level 3: Structure
  - type: is_valid_json
    params:
      message: "Response must be valid JSON"
  
  # Level 4: Semantics
  - type: llm_judge
    params:
      criteria: "Response is semantically appropriate for the query"
      judge_provider: "openai/gpt-4o-mini"
      message: "Must be semantically appropriate"
  
  # Level 5: Business logic
  - type: llm_judge
    params:
      criteria: "Response follows business rules and policies"
      judge_provider: "openai/gpt-4o-mini"
      message: "Must follow business rules"

Benefits:

The Specificity Spectrum

Balance between too loose and too strict:

# Too loose (might pass bad responses)
assertions:
  - type: not_empty

# Too strict (might fail good responses)
assertions:

# Just right (validates meaning, allows variation)
assertions:
  - type: content_includes
    params:
      patterns: "Paris"

Guidelines:

The Safety Net Pattern

Multiple validations to catch different failures:

turns:
  - role: user
    content: "Ask a question"
    assertions:
      # Content safety net
      - type: content_matches
        params:
          pattern: "(answer1|answer2|answer3)"
          message: "Should contain one of the expected answers"
      
      # Format safety net
      - type: is_valid_json
        params:
          message: "Should return valid JSON"
      
      - type: json_path
        params:
          jmespath_expression: "required_field"
          message: "Should have required field"

The Progressive Validation Pattern

Validate incrementally through conversation:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: progressive-validation

spec:
  task_type: test
  description: "Progressive Validation"
  
    turns:
      # Turn 1: Establish baseline
      - role: user
        content: "Start order"
        assertions:
          - type: content_includes
            params:
              patterns: ["order", "started"]
              message: "Should indicate order started"
      
      # Turn 2: Validate state progression
      - role: user
        content: "Add item"
        assertions:
          - type: content_includes
            params:
              patterns: ["item", "added"]
              message: "Should confirm item added"
      
      # Turn 3: Validate completion
      - role: user
        content: "Checkout"
        assertions:
          - type: content_includes
            params:
              patterns: ["order", "complete", "total", "confirmation"]
              message: "Should confirm order completion"

Advanced Techniques

Custom Validators

Write custom validation logic:

assertions:
  - type: custom
    validator: check_business_hours
    args:
      timezone: "America/New_York"

Implementation:

def check_business_hours(response: str, timezone: str) -> bool:
    # Extract time from response
    time_match = re.search(r'\d{1,2}:\d{2}', response)
    if not time_match:
        return False
    
    # Parse and validate
    time = datetime.strptime(time_match.group(), '%H:%M')
    return 9 <= time.hour < 17  # 9 AM - 5 PM

Multiple Assertions

Combine multiple checks (all must pass):

assertions:
  # All of these assertions must pass (implicit AND)
  - type: content_includes
    params:
      patterns: ["key_term"]
      message: "Must contain key term"
  
  - type: content_matches
    params:
      pattern: "^.{50,200}$"
      message: "Must be 50-200 characters"
  
  - type: llm_judge
    params:
      criteria: "Response has a positive tone"
      judge_provider: "openai/gpt-4o-mini"
      message: "Response should be positive"

  # For OR logic, use regex alternation:
  - type: content_matches
    params:
      pattern: "(option1|option2)"
      message: "Must contain option1 OR option2"

Context-Aware Validation

Validate based on context using separate scenarios:

# Note: Arena doesn't support conditional assertions.
# Instead, create separate scenarios for different contexts:

# Scenario 1: Premium users
- name: premium_user_support
  context:
    variables:
      user_tier: "premium"
  turns:
    - role: user
      content: "I need help"
    - role: assistant
      assertions:
        - type: content_includes
          params:
            patterns: ["priority support"]
            message: "Premium users should get priority support"

# Scenario 2: Standard users
- name: standard_user_support
  context:
    variables:
      user_tier: "standard"
  turns:
    - role: user
      content: "I need help"
    - role: assistant
      assertions:
        - type: content_includes
          params:
            patterns: ["standard support"]
            message: "Standard users should get standard support"

Statistical Validation

Validate across multiple runs:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: statistical-test

spec:
  task_type: test
  description: "Statistical Test"
  
    runs: 10  # Run 10 times
    # Note: Statistical validation would require running the scenario multiple times
    # and checking aggregate results. Arena doesn't have built-in statistical
    # validation, but you can run scenarios multiple times and analyze results.

Best Practices

1. Start Simple, Add Complexity

# Start with basic validation
assertions:
  - type: content_includes
    params:
      patterns: "answer"

# Add semantic validation
assertions:
  - type: content_includes
    params:
      patterns: "answer"

# Add format validation
assertions:
  - type: content_includes
    params:
      patterns: "answer"
  - type: is_valid_json
    value: true

2. Test Your Validations

Run validations against known good/bad responses:

validation_tests:
  good_responses:
    - "Paris is the capital of France"
    - "France's capital city is Paris"
    - "The capital of France is Paris"
  
  bad_responses:
    - "London is the capital"
    - "France is a country"
    - ""
  
  assertions:
    - type: content_includes
      params:
        patterns: "Paris"

3. Use Descriptive Failure Messages

assertions:
  - type: content_includes
    params:
      patterns: ["refund policy"]
      message: "Response must include refund policy details"
  
  - type: content_matches
    params:
      pattern: "^(?!.*(offensive|inappropriate)).*$"
      message: "Response must not contain inappropriate language"

4. Balance Precision and Recall

# High precision (few false positives) - exact pattern
assertions:
  - type: content_matches
    params:
      pattern: "^The specific answer is: [A-Z]$"
      message: "Must match exact format"

# High recall (few false negatives) - matches any option
assertions:
  - type: content_matches
    params:
      pattern: "(answer1|answer2|answer3)"
      message: "Must contain at least one answer"

# Balanced - specific but flexible
assertions:
  - type: content_includes
    params:
      patterns: ["answer", "option"]
      message: "Must discuss answer or option"

5. Document Validation Intent

assertions:
  # Validate core requirement
  - type: content_includes
    params:
      patterns: ["Paris"]
      message: "Must correctly identify capital"
  
  # Validate safety
  - type: content_matches
    params:
      pattern: "^(?!.*offensive).*$"
      message: "Must maintain appropriate tone"
  
  # Validate format
  - type: is_valid_json
    params:
      message: "Output must be parseable JSON"

Common Pitfalls

Over-Specification

# ❌ Too specific
assertions:

# ✅ Appropriately flexible
assertions:
  - type: content_includes
    params:
      patterns: "Paris"

Under-Specification

# ❌ Too loose
assertions:
  - type: content_matches
    params:
      pattern: ".+"
      message: "Must not be empty"

# ✅ Adequately constrained
assertions:
  - type: content_includes
    params:
      patterns: ["Paris", "France"]
      message: "Must mention Paris and France"
  - type: content_matches
    params:
      pattern: "^.{10,}$"
      message: "Must be at least 10 characters"

Brittle Assertions

# ❌ Breaks with minor changes
assertions:
  - type: content_matches
    params:
      pattern: "^The answer is"
      message: "Must start with exact phrase"

# ✅ Robust to variation
assertions:
  - type: content_includes
    params:
      patterns: ["answer"]
      message: "Must mention answer"

Missing Negative Tests

# ✅ Test both positive and negative
assertions:
  # Must have
  - type: content_includes
    params:
      patterns: ["correct_info"]
      message: "Must contain correct information"
  
  # Must not have (use negative lookahead)
  - type: content_matches
    params:
      pattern: "^(?!.*(incorrect|harmful)).*$"
      message: "Must not contain incorrect or harmful content"

Validation Checklist

Before finalizing assertions, check:

Conclusion

Effective validation:

PromptArena provides powerful validation tools that enable robust testing while accommodating LLM variability.

Further Reading