Scenario Design Principles
Understanding how to design effective, maintainable LLM test scenarios.
What Makes a Good Test Scenario?
Section titled “What Makes a Good Test Scenario?”A well-designed test scenario is:
- Clear: Purpose is obvious
- Focused: Tests one thing well
- Realistic: Models actual use cases
- Maintainable: Easy to update
- Robust: Handles LLM variability
Scenario Anatomy
Section titled “Scenario Anatomy”The Building Blocks
Section titled “The Building Blocks”apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: descriptive-name tags: [category, priority]
spec: task_type: support # Links to prompt configuration
fixtures: # Test-specific data user_tier: "premium"
turns: - role: user content: "User message" assertions: # Quality criteria - type: content_includes params: patterns: ["expected content"] message: "Should include expected content"Each element serves a purpose:
- apiVersion/kind: Schema compatibility and resource type
- metadata: Identifies the scenario with name and tags
- task_type: Connects to prompt configuration
- fixtures: Reusable test data and variables
- turns: Conversation exchanges
- assertions: Quality criteria with params and messages
Design Patterns
Section titled “Design Patterns”Pattern 1: Single Responsibility
Section titled “Pattern 1: Single Responsibility”Each test case should validate one specific behavior:
# ✅ Good: Tests one thingapiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: greeting-response
spec: turns: - role: user content: "Hello" assertions: - type: content_includes params: patterns: ["hello"] message: "Should greet back"
- type: content_matches params: pattern: "(?i)(hi|hello|welcome|nice)" message: "Should use friendly language"
# ❌ Avoid: Tests multiple unrelated thingsapiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: everything-test
spec: turns: - role: user content: "Hello" - role: user content: "What's your refund policy?" - role: user content: "How do I contact support?" - role: user content: "What are your hours?"Why: Single-responsibility tests are:
- Easier to debug when they fail
- More maintainable
- Better for regression testing
- Clearer in intent
Pattern 2: Arrange-Act-Assert
Section titled “Pattern 2: Arrange-Act-Assert”Structure each turn with clear phases:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: order-issue-test
spec: fixtures: order_id: "12345" order_status: "shipped"
turns: # Arrange: Set up context (via fixtures) # Act: LLM responds to user message # Assert: Verify behavior - role: user content: "I'm having an issue with my order" assertions: - type: content_includes params: patterns: ["12345"] message: "Should reference order ID"Pattern 3: Progressive Complexity
Section titled “Pattern 3: Progressive Complexity”Start simple, build up:
# Level 1: Basic interactionapiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: simple-greeting
spec: turns: - role: user content: "Hi"
# Level 2: With fixturesapiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: personalized-greeting
spec: fixtures: user_name: "Alice"
turns: - role: user content: "Hi" assertions: - type: content_includes params: patterns: ["Alice"] message: "Should use user name"
# Level 3: Multi-turnapiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: greeting-conversation
spec: turns: - role: user content: "Hi, I'm Alice" - role: user content: "What's your name?" - role: user content: "Nice to meet you"Pattern 4: Edge Case Coverage
Section titled “Pattern 4: Edge Case Coverage”Systematically test boundaries:
# Happy pathapiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: standard-input
spec: turns: - role: user content: "What are your hours?"
# Empty inputapiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: empty-message tags: [edge-case]
spec: turns: - role: user content: "" assertions: - type: error_handled params: message: "Should handle empty input"
# Very long inputapiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: long-message tags: [edge-case]
spec: fixtures: long_patterns: ["Very long message..."] # 10k chars
turns: - role: user content: "{{fixtures.long_text}}" assertions: - type: response_received params: message: "Should handle long input"
# Special charactersapiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: special-characters tags: [edge-case] turns: - role: user content: "Hello <script>alert('test')</script>" assertions: - type: content_matches params: pattern: "^(?!.*<script>).*$" message: "Must not echo script tags"
# Multiple languages - name: "Non-English Input" tags: [edge-case, i18n] turns: - role: user content: "¿Cuáles son sus horas?" assertions: - type: llm_judge params: criteria: "Response is in Spanish or English" judge_provider: "openai/gpt-4o-mini" message: "Must respond in appropriate language"Scenario Organization
Section titled “Scenario Organization”File Structure
Section titled “File Structure”Organize scenarios logically:
scenarios/├── smoke/ # Quick validation│ └── basic.yaml├── customer-support/ # Feature area│ ├── greetings.yaml│ ├── billing.yaml│ └── technical.yaml├── edge-cases/ # Special cases│ ├── input-validation.yaml│ └── error-handling.yaml└── regression/ # Known issues └── bug-fixes.yamlNaming Conventions
Section titled “Naming Conventions”Files: feature-area.yaml
customer-support.yamlorder-management.yamlaccount-settings.yamlTest Cases: "Action/State - Variation"
- name: "Greeting - First Time User"- name: "Greeting - Returning Customer"- name: "Refund Request - Within Policy"- name: "Refund Request - Outside Policy"Tags: [category, priority, type]
tags: [customer-service, high-priority, multi-turn]tags: [billing, critical, regression]tags: [onboarding, low-priority, smoke]Context Management
Section titled “Context Management”When to Use Context
Section titled “When to Use Context”Context provides state for the LLM:
# User profile contextcontext: user: name: "Alice" tier: "premium" account_age_days: 730
current_session: device: "mobile" location: "US-CA"
turns: - role: user content: "What benefits do I have?" # LLM can use context in responseContext vs. Conversation History
Section titled “Context vs. Conversation History”Context: Explicit state
context: order_id: "12345" order_status: "shipped"History: Implicit from previous turns
turns: - role: user content: "My order number is 12345" - role: user content: "What's the status?" # Refers to previous turnWhen to use each:
- Use context for: Known state, test fixtures, environment data
- Use history for: Natural conversation flow, context retention testing
Fixtures for Reusability
Section titled “Fixtures for Reusability”Define common data once:
fixtures: premium_user: tier: "premium" features: ["priority_support", "advanced_analytics"]
basic_user: tier: "basic" features: ["standard_support"]
long_text: | Lorem ipsum dolor sit amet... (1000+ words)
turns: - name: "Premium User Support" context: user: ${fixtures.premium_user}
- name: "Basic User Support" context: user: ${fixtures.basic_user}Assertion Design
Section titled “Assertion Design”Layered Assertions
Section titled “Layered Assertions”Apply multiple validation levels:
assertions: # Layer 1: Structure - type: content_matches params: pattern: ".+" message: "Response must not be empty" - type: content_matches params: pattern: "^.{1,500}$" message: "Response must be under 500 characters"
# Layer 2: Content - type: content_includes params: patterns: ["key information"] message: "Must contain key information"
# Layer 3: Quality - type: llm_judge params: criteria: "Response has positive sentiment" judge_provider: "openai/gpt-4o-mini" message: "Must be positive" - type: llm_judge params: criteria: "Response maintains professional tone" judge_provider: "openai/gpt-4o-mini" message: "Must be professional"
# Layer 4: Business Logic - type: llm_judge params: criteria: "Response complies with brand guidelines" judge_provider: "openai/gpt-4o-mini" message: "Must meet brand compliance"Assertion Specificity Spectrum
Section titled “Assertion Specificity Spectrum”Choose the right level:
# Too loose (accepts anything)assertions: - type: response_received
# Appropriate (validates behavior)assertions: - type: content_includes params: patterns: ["refund", "policy"] message: "Should mention refund policy"
# Too strict (brittle)assertions: - type: content_matches params: pattern: "^Our refund policy allows returns within 30 days\\\\.$" message: "Exact match - too rigid"Negative Assertions
Section titled “Negative Assertions”Test what should NOT happen:
assertions: # Should not mention competitors (negative lookahead) - type: content_matches params: pattern: "^(?!.*(CompetitorA|CompetitorB)).*$" message: "Must not mention competitors"
# Should not be negative - type: llm_judge params: criteria: "Response does not have negative sentiment" judge_provider: "openai/gpt-4o-mini" message: "Must not be negative"
# Should not use inappropriate language - type: content_matches params: pattern: "^(?!.*(inappropriate|offensive)).*$" message: "Must not contain inappropriate language"Multi-Turn Design
Section titled “Multi-Turn Design”Conversation Flow
Section titled “Conversation Flow”Design natural progressions:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: support-ticket-resolution
spec: task_type: test description: "Support Ticket Resolution"
tags: [multi-turn, support]
turns: # 1. Problem statement - role: user content: "I can't log into my account" assertions: - type: content_includes params: patterns: ["help", "account"]
# 2. Information gathering - role: user content: "I get an 'invalid password' error" assertions: - type: content_includes params: patterns: ["reset", "password"] - type: content_includes params: patterns: ["password", "reset"] message: "Should reference password reset"
# 3. Solution attempt - role: user content: "I tried resetting but didn't get the email" assertions: - type: content_includes params: patterns: ["spam", "check", "resend"]
# 4. Resolution - role: user content: "Found it in spam, thank you!" assertions: - type: llm_judge params: criteria: "Response acknowledges resolution positively" judge_provider: "openai/gpt-4o-mini" message: "Must have positive sentiment"State Transitions
Section titled “State Transitions”Test conversation state changes:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: booking-flow-state-machine
spec: task_type: test description: "Booking Flow State Machine"
turns: # State: INIT → COLLECTING_DESTINATION - role: user content: "I want to book a flight" assertions: - type: content_includes params: patterns: "destination"
# State: COLLECTING_DESTINATION → COLLECTING_DATE - role: user content: "To London" context: booking_state: "collecting_date" assertions: - type: content_includes params: patterns: ["London", "date", "when"]
# State: COLLECTING_DATE → CONFIRMING - role: user content: "Next Friday" context: booking_state: "confirming" assertions: - type: content_includes params: patterns: ["confirm", "London", "Friday"]Branch Testing
Section titled “Branch Testing”Test conversation branches:
turns: # Path A: Customer satisfiedturns: - role: user content: "Issue with order" - role: user content: "Order #12345" - role: user content: "That solved it, thanks!" assertions: - type: llm_judge params: criteria: "Response shows customer satisfaction" judge_provider: "openai/gpt-4o-mini" message: "Must have positive sentiment"
# Path B: Customer needs escalationturns: - role: user content: "Issue with order" - role: user content: "Order #12345" - role: user content: "That doesn't help, I need a manager" assertions: - type: content_includes params: patterns: ["manager", "supervisor", "escalate"]Performance Considerations
Section titled “Performance Considerations”Test Execution Speed
Section titled “Test Execution Speed”Balance coverage with speed:
# Fast: Smoke tests (mock provider)smoke_tests: runtime: < 30 seconds provider: mock scenarios: 10
# Medium: Integration testsintegration_tests: runtime: < 5 minutes provider: gpt-4o-mini scenarios: 50
# Slow: Comprehensive testscomprehensive_tests: runtime: < 20 minutes provider: [gpt-4o, claude, gemini] scenarios: 200Cost Optimization
Section titled “Cost Optimization”Design cost-effective scenarios:
# Expensive: Multiple providers, long conversationsturns: - name: "Full Conversation Flow" providers: [gpt-4o, claude-opus, gemini-pro] turns: [10 multi-turn exchanges] # Cost: ~$0.50 per run
# Optimized: Targeted testingturns: - name: "Critical Path Only" providers: [gpt-4o-mini] turns: [3 key exchanges] # Cost: ~$0.05 per runStrategies:
- Use mock providers for structure validation
- Use cheaper models (mini/flash) for regression tests
- Reserve expensive models for critical tests
- Batch similar tests together
Maintenance Patterns
Section titled “Maintenance Patterns”Versioning Scenarios
Section titled “Versioning Scenarios”Track scenario changes:
# Scenario metadatametadata: version: "2.1" last_updated: "2024-01-15" author: "alice@example.com" changelog: - version: "2.1" date: "2024-01-15" changes: "Added tone validation" - version: "2.0" date: "2024-01-01" changes: "Restructured multi-turn flow"Deprecation Strategy
Section titled “Deprecation Strategy”Handle outdated tests:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: legacy-greeting-test
spec: task_type: test description: "Legacy Greeting Test"
deprecated: true deprecated_reason: "Replaced by greeting-v2.yaml" skip: true
- name: "Current Greeting Test" tags: [active, v2]DRY (Don’t Repeat Yourself)
Section titled “DRY (Don’t Repeat Yourself)”Use templates and inheritance:
# Base templatetemplates: support_base: &support_base tags: [customer-support] context: department: "support" assertions: &support_expected - type: llm_judge params: criteria: "Response is helpful and supportive" judge_provider: "openai/gpt-4o-mini" message: "Must be helpful" - type: llm_judge params: criteria: "Response has positive sentiment" judge_provider: "openai/gpt-4o-mini" message: "Must be positive"
# Inherit templateturns: - name: "Billing Support" <<: *support_base turns: - role: user content: "Question about my bill" assertions: <<: *support_expected - type: content_includes params: patterns: "billing"Anti-Patterns to Avoid
Section titled “Anti-Patterns to Avoid”❌ God Scenarios
Section titled “❌ God Scenarios”# Too much in one scenarioturns:turns: # 50+ turns testing unrelated featuresFix: Break into focused scenarios
❌ Flaky Assertions
Section titled “❌ Flaky Assertions”# Unreliable tests - too rigidassertions: - type: content_matches params: pattern: "^Exactly this format$" message: "Exact format required - LLMs vary formatting"Fix: Use flexible assertions
# Better - flexible validationassertions: - type: content_includes params: patterns: ["key", "terms"] message: "Must contain key terms" - type: llm_judge params: criteria: "Response follows expected format generally" judge_provider: "openai/gpt-4o-mini" message: "Should follow general format"❌ Missing Context
Section titled “❌ Missing Context”# Unclear purposeturns:turns: - role: user content: "something"Fix: Add descriptive names and tags
❌ Hardcoded Data
Section titled “❌ Hardcoded Data”# Brittle test dataturns: - role: user content: "My order is #12345 placed on 2024-01-01 for $99.99"Fix: Use fixtures and context
Best Practices Summary
Section titled “Best Practices Summary”- One test, one purpose: Each scenario tests a specific behavior
- Use descriptive names: Make intent clear
- Tag appropriately: Enable filtering and organization
- Layer assertions: From structure to business logic
- Test edges: Cover happy path and edge cases
- Manage context: Use fixtures for reusability
- Design for maintenance: Version, document, refactor
- Balance cost and coverage: Optimize test execution
- Think in conversations: Model real user interactions
- Embrace variability: Write robust assertions for LLM behavior
Further Reading
Section titled “Further Reading”- Testing Philosophy - Why we test LLMs this way
- Validation Strategies - Choosing assertions
- Provider Comparison - Testing across providers
- How-To: Write Scenarios - Practical guide