Write Test Scenarios

Learn how to create and structure test scenarios for LLM testing.

Overview

Test scenarios define the conversation flows, expected behaviors, and validation criteria for your LLM applications. Each scenario is a YAML file in the PromptPack format.

Quick Start with Templates

The fastest way to create scenarios is using the project generator:

# Create a new test project with sample scenario
promptarena init my-test --quick --provider mock

# This generates scenarios/basic-test.yaml for you
cd my-test
cat scenarios/basic-test.yaml

The generated scenario includes working examples of assertions, multi-turn conversations, and best practices. Use it as a starting point and customize for your needs.

Manual Scenario Creation

If you prefer to create scenarios from scratch or need custom structures, create a file scenarios/basic-test.yaml:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: basic-customer-inquiry
  labels:
    category: customer-service
    priority: basic

spec:
  task_type: support  # Links to prompt configuration
  description: "Tests customer support responses"
  
  # The conversation turns
  turns:
    - role: user
      content: "What are your business hours?"
      assertions:
        - type: content_includes
          params:
            patterns: ["Monday"]
            message: "Should mention business days"
params:
            max_seconds: 3
            message: "Should respond quickly"
    
    - role: user
      content: "Do you offer weekend support?"
      assertions:
        - type: content_includes
          params:
            patterns: ["Saturday"]
            message: "Should mention weekend availability"

Key Components

Scenario Metadata

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: support-test
  labels:
    category: support
    region: us
  annotations:
    description: "Tests customer support responses"

Spec Section

The spec section contains your test configuration:

spec:
  task_type: support
  description: "Conversation flow test"
  
  context_metadata:
    urgency: high
    topic: billing
  
  turns:
    # Conversation turns go here

Conversation Turns

Define user messages and expected responses:

spec:
  turns:
    # Basic turn
    - role: user
      content: "Hello, I need help"
    
    # Turn with assertions
    - role: user
      content: "What's the refund policy?"
      assertions:
        - type: content_includes
          params:
            patterns: ["30 days"]
            message: "Should mention refund period"

Common Patterns

Multi-turn Conversations

Test conversation flow and context retention:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: multi-turn-support-dialog

spec:
  task_type: support
  
  turns:
    - role: user
      content: "I'm having issues with my account"
    
    - role: user
      content: "It won't let me log in"
      assertions:
        - type: content_includes
          params:
            patterns: ["password"]
            message: "Should offer password reset"
    
    - role: user
      content: "I already tried that"
      assertions:
        - type: content_includes
          params:
            patterns: ["help"]
            message: "Should provide alternative help"

Tool/Function Calling

Test tool integration:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: weather-query-with-tool

spec:
  task_type: assistant
  
  turns:
    - role: user
      content: "What's the weather in San Francisco?"
      assertions:
        - type: tools_called
          params:
            tools: ["get_weather"]
            message: "Should call weather tool"
        
        - type: content_includes
          params:
            patterns: ["temperature"]
            message: "Should mention temperature"

Advanced Features

Fixtures

Reuse common data across scenarios:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: account-specific-test

spec:
  task_type: support
  
  fixtures:
    sample_user:
      name: "Jane Smith"
      account_id: "12345"
      plan: "Premium"
  
  system_template: |
    You are helping user {{fixtures.sample_user.name}} 
    (Account: {{fixtures.sample_user.account_id}}).
  
  turns:
    - role: user
      content: "What's my current plan?"
      assertions:
        - type: content_includes
          params:
            patterns: ["Premium"]
            message: "Should reference user's Premium plan"

Conditional Assertions

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: order-status-check

spec:
  task_type: support
  
  turns:
    - role: user
      content: "Check my order status"
      assertions:
        - type: content_includes
          params:
            patterns: ["shipped"]
            message: "Should mention shipped status"

Multiple Providers

Test the same scenario across different models:

# In arena.yaml
scenarios:
  - path: ./scenarios/cross-provider.yaml
    providers: [openai-gpt4, claude-sonnet, gemini-pro]

Best Practices

1. Use Descriptive Names

# Good
- name: "Escalation: Frustrated customer with billing issue"

# Avoid
- name: "Test 1"

2. Tag Appropriately

tags: [billing, high-priority, multi-turn, tool-calling]

Useful for selective test runs:

# Run only high-priority tests
promptarena run --scenario high-priority

3. Balance Specificity

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: balance-check-flexible

spec:
  task_type: support
  
  turns:
    # Too specific (brittle)
    - role: user
      content: "What's my account balance?"
      assertions:
            patterns: ["Your account balance is exactly $42.00"]
            message: "Exact match - too brittle"
    
    # Better (flexible)
    - role: user
      content: "What's my account balance?"
      assertions:
        - type: content_includes
          params:
            patterns: ["account balance"]
            message: "Should mention balance"
        
        - type: content_matches
          params:
            pattern: '\$\d+\.\d{2}'
            message: "Should include dollar amount"

4. Test Edge Cases

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: edge-case-testing

spec:
  task_type: support
  
  fixtures:
    long_patterns: ["Very long text here..."]  # 10,000 chars
  
  turns:
    - role: user
      content: ""
      assertions:
        - type: content_includes
          params:
            patterns: ["help"]
            message: "Should handle empty input gracefully"
    
    - role: user
      content: "{{fixtures.long_text}}"
      assertions:
        - type: content_not_empty
          params:
            message: "Should respond to very long input"

File Organization

Structure your scenarios for maintainability:

scenarios/
├── customer-support/
│   ├── basic-inquiries.yaml
│   ├── billing-issues.yaml
│   └── escalations.yaml
├── content-generation/
│   ├── blog-posts.yaml
│   └── social-media.yaml
└── tool-integration/
    ├── weather.yaml
    └── database.yaml

Examples

See complete working examples in the examples directory.

Next Steps