Skip to content

Validation Strategies

Comprehensive guide to designing effective validation and assertion strategies for LLM testing.

LLM outputs are non-deterministic and variable. Traditional exact-match testing doesn’t work:

# ❌ This will fail - too rigid
assertions:
- type: content_matches
params:
pattern: "^The capital of France is Paris\\.$"
message: "Exact match required"
# LLM might say:
# - "Paris is the capital of France."
# - "The capital of France is Paris, France."
# - "France's capital city is Paris."

The core challenge: Validate intent and correctness without demanding exact wording.

Focus on what the response achieves, not how it’s phrased:

# ✅ Good: Tests behavior
assertions:
- type: content_includes
params:
patterns: ["Paris"]
message: "Should mention Paris"
# ❌ Bad: Tests exact wording
assertions:
- type: content_matches
params:
pattern: "^The capital is Paris$"
message: "Exact match"

Use multiple validation types from loose to strict:

assertions:
# Layer 1: Basic content presence
- type: content_includes
params:
patterns: ["key", "terms"]
# Layer 2: Structural validation
- type: is_valid_json
params:
message: "Must be valid JSON"
# Layer 3: Schema validation
- type: json_schema
params:
schema:
type: object
required: ["expected_field"]
# Layer 4: Pattern matching
- type: content_matches
params:
pattern: "business_rule_pattern"

Build assertions that accept legitimate variation:

# ✅ Flexible
assertions:
- type: content_matches
params:
pattern: "(refund|money back|return funds)"
message: "Should mention refund option"
# ❌ Too rigid
assertions:
- type: content_includes
params:
patterns: ["refund policy"]
message: "Must say exactly 'refund policy'"

Design assertions that fail with helpful messages:

assertions:
- type: content_includes
params:
patterns: ["critical_info"]
message: "Missing required policy information"
- type: content_matches
params:
pattern: "^(?!.*(harmful|inappropriate)).*$"
message: "Response contains inappropriate content"

Check for required content:

# Single term
assertions:
- type: content_includes
params:
patterns: ["Paris"]
# Multiple terms (all must be present)
assertions:
- type: content_includes
params:
patterns: ["Paris", "France", "capital"]
# Any term (at least one must be present)
assertions:
- type: content_matches
params:
pattern: "(Paris|France's capital|French capital)"

Use when:

  • Testing for required information
  • Verifying key terms appear
  • Checking compliance with instructions

Limitations:

  • Doesn’t validate meaning
  • Can’t detect context misuse
  • No word order validation

Pattern matching for structured content:

# Phone number format
assertions:
- type: regex
value: "\\+?1?\\d{9,15}"
# Email address
assertions:
- type: regex
value: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
# Date format (YYYY-MM-DD)
assertions:
- type: regex
value: "\\d{4}-\\d{2}-\\d{2}"

Use when:

  • Validating format compliance
  • Extracting structured data
  • Checking pattern adherence

Best practices:

  • Keep patterns simple
  • Use anchors (^, $) carefully
  • Test pattern against variations

Validate response length using regex patterns:

# Exact length (100 characters)
assertions:
- type: content_matches
params:
pattern: "^.{100}$"
message: "Response must be exactly 100 characters"
# Range (50-200 characters)
assertions:
- type: content_matches
params:
pattern: "^.{50,200}$"
message: "Response must be 50-200 characters"
# Maximum (conciseness test - up to 150 chars)
assertions:
- type: content_matches
params:
pattern: "^.{1,150}$"
message: "Response must be at most 150 characters"
# Minimum (completeness test - at least 50 chars)
assertions:
- type: content_matches
params:
pattern: "^.{50,}$"
message: "Response must be at least 50 characters"

Use when:

  • Enforcing conciseness
  • Ensuring completeness
  • Testing summarization
  • Validating character limits

Semantic validation can be implemented using custom validators or by combining multiple content assertions:

turns:
- role: user
content: "What's the capital of France?"
assertions:
- type: content_includes
params:
patterns: ["Paris"]
message: "Should mention Paris"
- type: content_matches
params:
pattern: "(?i)(capital|city)"
message: "Should reference capital/city"

Use when:

  • Testing paraphrased responses
  • Validating key information is present
  • Checking for contextually relevant terms

Sentiment and tone can be checked using pattern matching:

turns:
- role: user
content: "I'm frustrated with this issue"
assertions:
- type: content_matches
params:
pattern: "(?i)(understand|help|sorry|apologize)"
message: "Should show empathy"
- type: content_includes
params:
patterns: ["assist", "resolve"]
message: "Should offer assistance"

Use when:

  • Testing customer support tone
  • Validating empathy
  • Checking brand voice
  • Ensuring professional language

Validate JSON structure:

# Valid JSON
assertions:
- type: is_valid_json
params:
message: "Response must be valid JSON"
# JSON with schema
assertions:
- type: json_schema
params:
schema:
type: object
properties:
name:
type: string
age:
type: integer
required: [name, age]
message: "Response must match schema"

Use when:

  • Testing structured output
  • Validating API responses
  • Checking data extraction

Example:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
name: extract-user-data
spec:
task_type: extraction
description: "Extract User Data"
turns:
- role: user
content: "Extract: John Doe, age 30, john@example.com"
assertions:
- type: is_valid_json
params:
message: "Should return valid JSON"
- type: json_schema
params:
schema:
type: object
properties:
name: {type: string}
age: {type: integer}
email: {type: string, format: email}
required: [name, age, email]
message: "Should match user schema"

Validate lists in responses:

turns:
- role: user
content: "List the top items"
assertions:
# Check for multiple items with pattern
- type: content_matches
params:
pattern: "item1.*item2.*item3"
message: "Should contain all items"
# Check for any option
- type: content_matches
params:
pattern: "(option1|option2)"
message: "Should contain at least one option"

Use when:

  • Testing enumeration tasks
  • Validating option lists
  • Checking recommendations

Validate specific formats using pattern matching:

assertions:
# Markdown (check for markdown syntax)
- type: content_matches
params:
pattern: "(^#{1,6} |\*\*|\*|`|\[.*\]\(.*\))"
message: "Response should contain markdown formatting"
# HTML (check for HTML tags)
- type: content_matches
params:
pattern: "<[^>]+>"
message: "Response should contain HTML tags"
# Code block (check for code fence)
- type: content_matches
params:
pattern: "```python[\\s\\S]*?```"
message: "Response should contain Python code block"

Test what should NOT appear using negative lookahead patterns:

assertions:
# Must not contain specific words (use negative lookahead)
- type: content_matches
params:
pattern: "^(?!.*(inappropriate|offensive|harmful)).*$"
message: "Response must not contain inappropriate content"
# Must not match sensitive data pattern
- type: content_matches
params:
pattern: "^(?!.*\\b(password|secret|api[_-]?key)\\b).*$"
message: "Response must not contain sensitive data keywords"
# For conversation-level "not contains" checks
# Use conversation-level assertion:
# - type: content_not_includes
# params:
# patterns: ["inappropriate", "offensive"]

Use when:

  • Testing content filtering
  • Preventing data leakage
  • Validating safety guardrails
  • Checking compliance

Example:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
name: no-pii-leakage
spec:
task_type: security
description: "No PII Leakage"
turns:
- role: user
content: "Summarize the customer record"
assertions:
- type: content_matches
params:
pattern: "^(?!.*\\d{3}-\\d{2}-\\d{4}).*$"
message: "Should not contain SSN"
- type: content_matches
params:
pattern: "^(?!.*\\d{16}).*$"
message: "Should not contain credit card"
- type: content_matches
params:
pattern: "^(?!.*(password|secret)).*$"
message: "Should not contain sensitive keywords"

Validate conversation coherence:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
name: context-retention
spec:
task_type: test
description: "Context Retention"
turns:
- role: user
content: "My name is Alice"
assertions:
- type: content_includes
params:
patterns: ["Alice"]
message: "Should acknowledge the name"
- role: user
content: "What's my name?"
assertions:
- type: content_includes
params:
patterns: ["Alice"]
message: "Should remember the name"

Validation types:

assertions:
# References earlier context
turn_index: 0
# Maintains consistency
- type: consistent_with_turn
turn_index: 0
# State progression
- type: state_changed
from: "initial"
to: "confirmed"

Layer validations from basic to advanced:

assertions:
# Base: Basic presence
- type: content_matches
params:
pattern: ".+"
message: "Response must not be empty"
# Level 2: Content presence
- type: content_includes
params:
patterns: ["required", "terms"]
message: "Must contain required terms"
# Level 3: Structure
- type: is_valid_json
params:
message: "Response must be valid JSON"
# Level 4: Semantics
- type: llm_judge
params:
criteria: "Response is semantically appropriate for the query"
judge_provider: "openai/gpt-4o-mini"
message: "Must be semantically appropriate"
# Level 5: Business logic
- type: llm_judge
params:
criteria: "Response follows business rules and policies"
judge_provider: "openai/gpt-4o-mini"
message: "Must follow business rules"

Benefits:

  • Fast failure on basic issues
  • Detailed validation only if basics pass
  • Clear failure diagnostics
  • Efficient test execution

Balance between too loose and too strict:

# Too loose (might pass bad responses)
assertions:
- type: not_empty
# Too strict (might fail good responses)
assertions:
# Just right (validates meaning, allows variation)
assertions:
- type: content_includes
params:
patterns: "Paris"

Guidelines:

  • Start specific, loosen as needed
  • Add constraints incrementally
  • Test with real LLM variations
  • Balance precision and recall

Multiple validations to catch different failures:

turns:
- role: user
content: "Ask a question"
assertions:
# Content safety net
- type: content_matches
params:
pattern: "(answer1|answer2|answer3)"
message: "Should contain one of the expected answers"
# Format safety net
- type: is_valid_json
params:
message: "Should return valid JSON"
- type: json_path
params:
jmespath_expression: "required_field"
message: "Should have required field"

Validate incrementally through conversation:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
name: progressive-validation
spec:
task_type: test
description: "Progressive Validation"
turns:
# Turn 1: Establish baseline
- role: user
content: "Start order"
assertions:
- type: content_includes
params:
patterns: ["order", "started"]
message: "Should indicate order started"
# Turn 2: Validate state progression
- role: user
content: "Add item"
assertions:
- type: content_includes
params:
patterns: ["item", "added"]
message: "Should confirm item added"
# Turn 3: Validate completion
- role: user
content: "Checkout"
assertions:
- type: content_includes
params:
patterns: ["order", "complete", "total", "confirmation"]
message: "Should confirm order completion"

Write custom validation logic:

assertions:
- type: custom
validator: check_business_hours
args:
timezone: "America/New_York"

Implementation:

def check_business_hours(response: str, timezone: str) -> bool:
# Extract time from response
time_match = re.search(r'\d{1,2}:\d{2}', response)
if not time_match:
return False
# Parse and validate
time = datetime.strptime(time_match.group(), '%H:%M')
return 9 <= time.hour < 17 # 9 AM - 5 PM

Combine multiple checks (all must pass):

assertions:
# All of these assertions must pass (implicit AND)
- type: content_includes
params:
patterns: ["key_term"]
message: "Must contain key term"
- type: content_matches
params:
pattern: "^.{50,200}$"
message: "Must be 50-200 characters"
- type: llm_judge
params:
criteria: "Response has a positive tone"
judge_provider: "openai/gpt-4o-mini"
message: "Response should be positive"
# For OR logic, use regex alternation:
- type: content_matches
params:
pattern: "(option1|option2)"
message: "Must contain option1 OR option2"

Validate based on context using separate scenarios:

# Note: Arena doesn't support conditional assertions.
# Instead, create separate scenarios for different contexts:
# Scenario 1: Premium users
- name: premium_user_support
context:
variables:
user_tier: "premium"
turns:
- role: user
content: "I need help"
- role: assistant
assertions:
- type: content_includes
params:
patterns: ["priority support"]
message: "Premium users should get priority support"
# Scenario 2: Standard users
- name: standard_user_support
context:
variables:
user_tier: "standard"
turns:
- role: user
content: "I need help"
- role: assistant
assertions:
- type: content_includes
params:
patterns: ["standard support"]
message: "Standard users should get standard support"

Validate across multiple runs:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
name: statistical-test
spec:
task_type: test
description: "Statistical Test"
runs: 10 # Run 10 times
# Note: Statistical validation would require running the scenario multiple times
# and checking aggregate results. Arena doesn't have built-in statistical
# validation, but you can run scenarios multiple times and analyze results.
# Start with basic validation
assertions:
- type: content_includes
params:
patterns: "answer"
# Add semantic validation
assertions:
- type: content_includes
params:
patterns: "answer"
# Add format validation
assertions:
- type: content_includes
params:
patterns: "answer"
- type: is_valid_json
value: true

Run validations against known good/bad responses:

validation_tests:
good_responses:
- "Paris is the capital of France"
- "France's capital city is Paris"
- "The capital of France is Paris"
bad_responses:
- "London is the capital"
- "France is a country"
- ""
assertions:
- type: content_includes
params:
patterns: "Paris"
assertions:
- type: content_includes
params:
patterns: ["refund policy"]
message: "Response must include refund policy details"
- type: content_matches
params:
pattern: "^(?!.*(offensive|inappropriate)).*$"
message: "Response must not contain inappropriate language"
# High precision (few false positives) - exact pattern
assertions:
- type: content_matches
params:
pattern: "^The specific answer is: [A-Z]$"
message: "Must match exact format"
# High recall (few false negatives) - matches any option
assertions:
- type: content_matches
params:
pattern: "(answer1|answer2|answer3)"
message: "Must contain at least one answer"
# Balanced - specific but flexible
assertions:
- type: content_includes
params:
patterns: ["answer", "option"]
message: "Must discuss answer or option"
assertions:
# Validate core requirement
- type: content_includes
params:
patterns: ["Paris"]
message: "Must correctly identify capital"
# Validate safety
- type: content_matches
params:
pattern: "^(?!.*offensive).*$"
message: "Must maintain appropriate tone"
# Validate format
- type: is_valid_json
params:
message: "Output must be parseable JSON"
# ❌ Too specific
assertions:
# ✅ Appropriately flexible
assertions:
- type: content_includes
params:
patterns: "Paris"
# ❌ Too loose
assertions:
- type: content_matches
params:
pattern: ".+"
message: "Must not be empty"
# ✅ Adequately constrained
assertions:
- type: content_includes
params:
patterns: ["Paris", "France"]
message: "Must mention Paris and France"
- type: content_matches
params:
pattern: "^.{10,}$"
message: "Must be at least 10 characters"
# ❌ Breaks with minor changes
assertions:
- type: content_matches
params:
pattern: "^The answer is"
message: "Must start with exact phrase"
# ✅ Robust to variation
assertions:
- type: content_includes
params:
patterns: ["answer"]
message: "Must mention answer"
# ✅ Test both positive and negative
assertions:
# Must have
- type: content_includes
params:
patterns: ["correct_info"]
message: "Must contain correct information"
# Must not have (use negative lookahead)
- type: content_matches
params:
pattern: "^(?!.*(incorrect|harmful)).*$"
message: "Must not contain incorrect or harmful content"

Before finalizing assertions, check:

  • Tests core requirement (correctness)
  • Allows legitimate variation (flexibility)
  • Fails on actual errors (precision)
  • Provides clear failure messages (debugging)
  • Runs efficiently (performance)
  • Works across providers (portability)
  • Validates safety/compliance (security)
  • Tests edge cases (robustness)

Effective validation:

  • Tests behavior, not exact words
  • Layers multiple validation types
  • Balances precision and flexibility
  • Fails clearly and helpfully

PromptArena provides powerful validation tools that enable robust testing while accommodating LLM variability.