Validation Strategies
Comprehensive guide to designing effective validation and assertion strategies for LLM testing.
The Validation Challenge
Section titled “The Validation Challenge”LLM outputs are non-deterministic and variable. Traditional exact-match testing doesn’t work:
# ❌ This will fail - too rigidassertions: - type: content_matches params: pattern: "^The capital of France is Paris\\.$" message: "Exact match required"
# LLM might say:# - "Paris is the capital of France."# - "The capital of France is Paris, France."# - "France's capital city is Paris."The core challenge: Validate intent and correctness without demanding exact wording.
Validation Principles
Section titled “Validation Principles”1. Test Behavior, Not Words
Section titled “1. Test Behavior, Not Words”Focus on what the response achieves, not how it’s phrased:
# ✅ Good: Tests behaviorassertions: - type: content_includes params: patterns: ["Paris"] message: "Should mention Paris"
# ❌ Bad: Tests exact wordingassertions: - type: content_matches params: pattern: "^The capital is Paris$" message: "Exact match"2. Layer Your Validations
Section titled “2. Layer Your Validations”Use multiple validation types from loose to strict:
assertions: # Layer 1: Basic content presence - type: content_includes params: patterns: ["key", "terms"]
# Layer 2: Structural validation - type: is_valid_json params: message: "Must be valid JSON"
# Layer 3: Schema validation - type: json_schema params: schema: type: object required: ["expected_field"]
# Layer 4: Pattern matching - type: content_matches params: pattern: "business_rule_pattern"3. Tolerate Variation
Section titled “3. Tolerate Variation”Build assertions that accept legitimate variation:
# ✅ Flexibleassertions: - type: content_matches params: pattern: "(refund|money back|return funds)" message: "Should mention refund option"
# ❌ Too rigidassertions: - type: content_includes params: patterns: ["refund policy"] message: "Must say exactly 'refund policy'"4. Fail Fast, Fail Clear
Section titled “4. Fail Fast, Fail Clear”Design assertions that fail with helpful messages:
assertions: - type: content_includes params: patterns: ["critical_info"] message: "Missing required policy information"
- type: content_matches params: pattern: "^(?!.*(harmful|inappropriate)).*$" message: "Response contains inappropriate content"Validation Types
Section titled “Validation Types”Content-Based Validation
Section titled “Content-Based Validation”String Contains
Section titled “String Contains”Check for required content:
# Single termassertions: - type: content_includes params: patterns: ["Paris"]
# Multiple terms (all must be present)assertions: - type: content_includes params: patterns: ["Paris", "France", "capital"]
# Any term (at least one must be present)assertions: - type: content_matches params: pattern: "(Paris|France's capital|French capital)"Use when:
- Testing for required information
- Verifying key terms appear
- Checking compliance with instructions
Limitations:
- Doesn’t validate meaning
- Can’t detect context misuse
- No word order validation
Regular Expressions
Section titled “Regular Expressions”Pattern matching for structured content:
# Phone number formatassertions: - type: regex value: "\\+?1?\\d{9,15}"
# Email addressassertions: - type: regex value: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
# Date format (YYYY-MM-DD)assertions: - type: regex value: "\\d{4}-\\d{2}-\\d{2}"Use when:
- Validating format compliance
- Extracting structured data
- Checking pattern adherence
Best practices:
- Keep patterns simple
- Use anchors (^, $) carefully
- Test pattern against variations
String Length
Section titled “String Length”Validate response length using regex patterns:
# Exact length (100 characters)assertions: - type: content_matches params: pattern: "^.{100}$" message: "Response must be exactly 100 characters"
# Range (50-200 characters)assertions: - type: content_matches params: pattern: "^.{50,200}$" message: "Response must be 50-200 characters"
# Maximum (conciseness test - up to 150 chars)assertions: - type: content_matches params: pattern: "^.{1,150}$" message: "Response must be at most 150 characters"
# Minimum (completeness test - at least 50 chars)assertions: - type: content_matches params: pattern: "^.{50,}$" message: "Response must be at least 50 characters"Use when:
- Enforcing conciseness
- Ensuring completeness
- Testing summarization
- Validating character limits
Semantic Validation
Section titled “Semantic Validation”Semantic validation can be implemented using custom validators or by combining multiple content assertions:
turns: - role: user content: "What's the capital of France?" assertions: - type: content_includes params: patterns: ["Paris"] message: "Should mention Paris"
- type: content_matches params: pattern: "(?i)(capital|city)" message: "Should reference capital/city"Use when:
- Testing paraphrased responses
- Validating key information is present
- Checking for contextually relevant terms
Sentiment Analysis Analysis
Section titled “Sentiment Analysis Analysis”Sentiment and tone can be checked using pattern matching:
turns: - role: user content: "I'm frustrated with this issue" assertions: - type: content_matches params: pattern: "(?i)(understand|help|sorry|apologize)" message: "Should show empathy"
- type: content_includes params: patterns: ["assist", "resolve"] message: "Should offer assistance"Use when:
- Testing customer support tone
- Validating empathy
- Checking brand voice
- Ensuring professional language
Structural Validation
Section titled “Structural Validation”JSON Validation
Section titled “JSON Validation”Validate JSON structure:
# Valid JSONassertions: - type: is_valid_json params: message: "Response must be valid JSON"
# JSON with schemaassertions: - type: json_schema params: schema: type: object properties: name: type: string age: type: integer required: [name, age] message: "Response must match schema"Use when:
- Testing structured output
- Validating API responses
- Checking data extraction
Example:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: extract-user-data
spec: task_type: extraction description: "Extract User Data"
turns: - role: user content: "Extract: John Doe, age 30, john@example.com" assertions: - type: is_valid_json params: message: "Should return valid JSON" - type: json_schema params: schema: type: object properties: name: {type: string} age: {type: integer} email: {type: string, format: email} required: [name, age, email] message: "Should match user schema"List/Array Validation
Section titled “List/Array Validation”Validate lists in responses:
turns: - role: user content: "List the top items" assertions: # Check for multiple items with pattern - type: content_matches params: pattern: "item1.*item2.*item3" message: "Should contain all items"
# Check for any option - type: content_matches params: pattern: "(option1|option2)" message: "Should contain at least one option"Use when:
- Testing enumeration tasks
- Validating option lists
- Checking recommendations
Format Compliance
Section titled “Format Compliance”Validate specific formats using pattern matching:
assertions: # Markdown (check for markdown syntax) - type: content_matches params: pattern: "(^#{1,6} |\*\*|\*|`|\[.*\]\(.*\))" message: "Response should contain markdown formatting"
# HTML (check for HTML tags) - type: content_matches params: pattern: "<[^>]+>" message: "Response should contain HTML tags"
# Code block (check for code fence) - type: content_matches params: pattern: "```python[\\s\\S]*?```" message: "Response should contain Python code block"Negative Validation
Section titled “Negative Validation”Test what should NOT appear using negative lookahead patterns:
assertions: # Must not contain specific words (use negative lookahead) - type: content_matches params: pattern: "^(?!.*(inappropriate|offensive|harmful)).*$" message: "Response must not contain inappropriate content"
# Must not match sensitive data pattern - type: content_matches params: pattern: "^(?!.*\\b(password|secret|api[_-]?key)\\b).*$" message: "Response must not contain sensitive data keywords"
# For conversation-level "not contains" checks # Use conversation-level assertion: # - type: content_not_includes # params: # patterns: ["inappropriate", "offensive"]Use when:
- Testing content filtering
- Preventing data leakage
- Validating safety guardrails
- Checking compliance
Example:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: no-pii-leakage
spec: task_type: security description: "No PII Leakage"
turns: - role: user content: "Summarize the customer record" assertions: - type: content_matches params: pattern: "^(?!.*\\d{3}-\\d{2}-\\d{4}).*$" message: "Should not contain SSN" - type: content_matches params: pattern: "^(?!.*\\d{16}).*$" message: "Should not contain credit card" - type: content_matches params: pattern: "^(?!.*(password|secret)).*$" message: "Should not contain sensitive keywords"Multi-Turn Validation
Section titled “Multi-Turn Validation”Validate conversation coherence:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: context-retention
spec: task_type: test description: "Context Retention"
turns: - role: user content: "My name is Alice" assertions: - type: content_includes params: patterns: ["Alice"] message: "Should acknowledge the name"
- role: user content: "What's my name?" assertions: - type: content_includes params: patterns: ["Alice"] message: "Should remember the name"Validation types:
assertions: # References earlier contextturn_index: 0
# Maintains consistency - type: consistent_with_turn turn_index: 0
# State progression - type: state_changed from: "initial" to: "confirmed"Validation Patterns
Section titled “Validation Patterns”The Pyramid Pattern
Section titled “The Pyramid Pattern”Layer validations from basic to advanced:
assertions: # Base: Basic presence - type: content_matches params: pattern: ".+" message: "Response must not be empty"
# Level 2: Content presence - type: content_includes params: patterns: ["required", "terms"] message: "Must contain required terms"
# Level 3: Structure - type: is_valid_json params: message: "Response must be valid JSON"
# Level 4: Semantics - type: llm_judge params: criteria: "Response is semantically appropriate for the query" judge_provider: "openai/gpt-4o-mini" message: "Must be semantically appropriate"
# Level 5: Business logic - type: llm_judge params: criteria: "Response follows business rules and policies" judge_provider: "openai/gpt-4o-mini" message: "Must follow business rules"Benefits:
- Fast failure on basic issues
- Detailed validation only if basics pass
- Clear failure diagnostics
- Efficient test execution
The Specificity Spectrum
Section titled “The Specificity Spectrum”Balance between too loose and too strict:
# Too loose (might pass bad responses)assertions: - type: not_empty
# Too strict (might fail good responses)assertions:
# Just right (validates meaning, allows variation)assertions: - type: content_includes params: patterns: "Paris"Guidelines:
- Start specific, loosen as needed
- Add constraints incrementally
- Test with real LLM variations
- Balance precision and recall
The Safety Net Pattern
Section titled “The Safety Net Pattern”Multiple validations to catch different failures:
turns: - role: user content: "Ask a question" assertions: # Content safety net - type: content_matches params: pattern: "(answer1|answer2|answer3)" message: "Should contain one of the expected answers"
# Format safety net - type: is_valid_json params: message: "Should return valid JSON"
- type: json_path params: jmespath_expression: "required_field" message: "Should have required field"The Progressive Validation Pattern
Section titled “The Progressive Validation Pattern”Validate incrementally through conversation:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: progressive-validation
spec: task_type: test description: "Progressive Validation"
turns: # Turn 1: Establish baseline - role: user content: "Start order" assertions: - type: content_includes params: patterns: ["order", "started"] message: "Should indicate order started"
# Turn 2: Validate state progression - role: user content: "Add item" assertions: - type: content_includes params: patterns: ["item", "added"] message: "Should confirm item added"
# Turn 3: Validate completion - role: user content: "Checkout" assertions: - type: content_includes params: patterns: ["order", "complete", "total", "confirmation"] message: "Should confirm order completion"Advanced Techniques
Section titled “Advanced Techniques”Custom Validators
Section titled “Custom Validators”Write custom validation logic:
assertions: - type: custom validator: check_business_hours args: timezone: "America/New_York"Implementation:
def check_business_hours(response: str, timezone: str) -> bool: # Extract time from response time_match = re.search(r'\d{1,2}:\d{2}', response) if not time_match: return False
# Parse and validate time = datetime.strptime(time_match.group(), '%H:%M') return 9 <= time.hour < 17 # 9 AM - 5 PMMultiple Assertions
Section titled “Multiple Assertions”Combine multiple checks (all must pass):
assertions: # All of these assertions must pass (implicit AND) - type: content_includes params: patterns: ["key_term"] message: "Must contain key term"
- type: content_matches params: pattern: "^.{50,200}$" message: "Must be 50-200 characters"
- type: llm_judge params: criteria: "Response has a positive tone" judge_provider: "openai/gpt-4o-mini" message: "Response should be positive"
# For OR logic, use regex alternation: - type: content_matches params: pattern: "(option1|option2)" message: "Must contain option1 OR option2"Context-Aware Validation
Section titled “Context-Aware Validation”Validate based on context using separate scenarios:
# Note: Arena doesn't support conditional assertions.# Instead, create separate scenarios for different contexts:
# Scenario 1: Premium users- name: premium_user_support context: variables: user_tier: "premium" turns: - role: user content: "I need help" - role: assistant assertions: - type: content_includes params: patterns: ["priority support"] message: "Premium users should get priority support"
# Scenario 2: Standard users- name: standard_user_support context: variables: user_tier: "standard" turns: - role: user content: "I need help" - role: assistant assertions: - type: content_includes params: patterns: ["standard support"] message: "Standard users should get standard support"Statistical Validation
Section titled “Statistical Validation”Validate across multiple runs:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: statistical-test
spec: task_type: test description: "Statistical Test"
runs: 10 # Run 10 times # Note: Statistical validation would require running the scenario multiple times # and checking aggregate results. Arena doesn't have built-in statistical # validation, but you can run scenarios multiple times and analyze results.Best Practices
Section titled “Best Practices”1. Start Simple, Add Complexity
Section titled “1. Start Simple, Add Complexity”# Start with basic validationassertions: - type: content_includes params: patterns: "answer"
# Add semantic validationassertions: - type: content_includes params: patterns: "answer"
# Add format validationassertions: - type: content_includes params: patterns: "answer" - type: is_valid_json value: true2. Test Your Validations
Section titled “2. Test Your Validations”Run validations against known good/bad responses:
validation_tests: good_responses: - "Paris is the capital of France" - "France's capital city is Paris" - "The capital of France is Paris"
bad_responses: - "London is the capital" - "France is a country" - ""
assertions: - type: content_includes params: patterns: "Paris"3. Use Descriptive Failure Messages
Section titled “3. Use Descriptive Failure Messages”assertions: - type: content_includes params: patterns: ["refund policy"] message: "Response must include refund policy details"
- type: content_matches params: pattern: "^(?!.*(offensive|inappropriate)).*$" message: "Response must not contain inappropriate language"4. Balance Precision and Recall
Section titled “4. Balance Precision and Recall”# High precision (few false positives) - exact patternassertions: - type: content_matches params: pattern: "^The specific answer is: [A-Z]$" message: "Must match exact format"
# High recall (few false negatives) - matches any optionassertions: - type: content_matches params: pattern: "(answer1|answer2|answer3)" message: "Must contain at least one answer"
# Balanced - specific but flexibleassertions: - type: content_includes params: patterns: ["answer", "option"] message: "Must discuss answer or option"5. Document Validation Intent
Section titled “5. Document Validation Intent”assertions: # Validate core requirement - type: content_includes params: patterns: ["Paris"] message: "Must correctly identify capital"
# Validate safety - type: content_matches params: pattern: "^(?!.*offensive).*$" message: "Must maintain appropriate tone"
# Validate format - type: is_valid_json params: message: "Output must be parseable JSON"Common Pitfalls
Section titled “Common Pitfalls”Over-Specification
Section titled “Over-Specification”# ❌ Too specificassertions:
# ✅ Appropriately flexibleassertions: - type: content_includes params: patterns: "Paris"Under-Specification
Section titled “Under-Specification”# ❌ Too looseassertions: - type: content_matches params: pattern: ".+" message: "Must not be empty"
# ✅ Adequately constrainedassertions: - type: content_includes params: patterns: ["Paris", "France"] message: "Must mention Paris and France" - type: content_matches params: pattern: "^.{10,}$" message: "Must be at least 10 characters"Brittle Assertions
Section titled “Brittle Assertions”# ❌ Breaks with minor changesassertions: - type: content_matches params: pattern: "^The answer is" message: "Must start with exact phrase"
# ✅ Robust to variationassertions: - type: content_includes params: patterns: ["answer"] message: "Must mention answer"Missing Negative Tests
Section titled “Missing Negative Tests”# ✅ Test both positive and negativeassertions: # Must have - type: content_includes params: patterns: ["correct_info"] message: "Must contain correct information"
# Must not have (use negative lookahead) - type: content_matches params: pattern: "^(?!.*(incorrect|harmful)).*$" message: "Must not contain incorrect or harmful content"Validation Checklist
Section titled “Validation Checklist”Before finalizing assertions, check:
- Tests core requirement (correctness)
- Allows legitimate variation (flexibility)
- Fails on actual errors (precision)
- Provides clear failure messages (debugging)
- Runs efficiently (performance)
- Works across providers (portability)
- Validates safety/compliance (security)
- Tests edge cases (robustness)
Conclusion
Section titled “Conclusion”Effective validation:
- Tests behavior, not exact words
- Layers multiple validation types
- Balances precision and flexibility
- Fails clearly and helpfully
PromptArena provides powerful validation tools that enable robust testing while accommodating LLM variability.
Further Reading
Section titled “Further Reading”- Testing Philosophy - Core testing principles
- Scenario Design - Designing effective scenarios
- Provider Comparison - Cross-provider testing
- Reference: Assertions - Complete assertion reference
- Reference: Validators - Validator documentation