Skip to content

Write Test Scenarios

Learn how to create and structure test scenarios for LLM testing.

Test scenarios define the conversation flows, expected behaviors, and validation criteria for your LLM applications. Each scenario is a YAML file in the PromptPack format.

The fastest way to create scenarios is using the project generator:

Terminal window
# Create a new test project with sample scenario
promptarena init my-test --quick --provider mock
# This generates scenarios/basic-test.yaml for you
cd my-test
cat scenarios/basic-test.yaml

The generated scenario includes working examples of assertions, multi-turn conversations, and best practices. Use it as a starting point and customize for your needs.

If you prefer to create scenarios from scratch or need custom structures, create a file scenarios/basic-test.yaml:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
name: basic-customer-inquiry
labels:
category: customer-service
priority: basic
spec:
task_type: support # Links to prompt configuration
description: "Tests customer support responses"
# The conversation turns
turns:
- role: user
content: "What are your business hours?"
assertions:
- type: content_includes
params:
patterns: ["Monday"]
message: "Should mention business days"
params:
max_seconds: 3
message: "Should respond quickly"
- role: user
content: "Do you offer weekend support?"
assertions:
- type: content_includes
params:
patterns: ["Saturday"]
message: "Should mention weekend availability"
apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
name: support-test
labels:
category: support
region: us
annotations:
description: "Tests customer support responses"

The spec section contains your test configuration:

spec:
task_type: support
description: "Conversation flow test"
context_metadata:
urgency: high
topic: billing
turns:
# Conversation turns go here

Use required_capabilities to ensure scenarios only run against providers that support specific features:

spec:
task_type: capability-test
description: "Test vision with streaming"
required_capabilities:
- vision
- streaming
streaming: true
turns:
- role: user
parts:
- type: text
text: "Describe this image"
- type: image
media:
url: "https://example.com/image.png"
mime_type: "image/png"

Available Capabilities: text, streaming, vision, tools, json, audio, video, documents, duplex

This scenario will only run against providers that have both vision AND streaming in their capabilities list.

Use tool_policy to control tool calling behavior:

spec:
task_type: test
description: "Text-only test without tools"
tool_policy:
tool_choice: none # Disable tool calling
max_tool_calls_per_turn: 0
max_total_tool_calls: 0
turns:
- role: user
content: "Just respond with text, no tools"

Tool Choice Options:

  • none - Disable tool calling entirely
  • auto - Let the model decide (default)
  • required - Force tool use
  • <tool_name> - Force specific tool

Example - Require tool use:

spec:
tool_policy:
tool_choice: required
max_tool_calls_per_turn: 3
max_total_tool_calls: 10

This is useful when:

  • Testing scenarios that should NOT use tools (text, streaming, vision tests)
  • Forcing tool usage for tool-specific tests
  • Limiting tool calls to prevent runaway loops

Define user messages and expected responses:

spec:
turns:
# Basic turn
- role: user
content: "Hello, I need help"
# Turn with assertions
- role: user
content: "What's the refund policy?"
assertions:
- type: content_includes
params:
patterns: ["30 days"]
message: "Should mention refund period"

Test conversation flow and context retention:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
name: multi-turn-support-dialog
spec:
task_type: support
turns:
- role: user
content: "I'm having issues with my account"
- role: user
content: "It won't let me log in"
assertions:
- type: content_includes
params:
patterns: ["password"]
message: "Should offer password reset"
- role: user
content: "I already tried that"
assertions:
- type: content_includes
params:
patterns: ["help"]
message: "Should provide alternative help"

Test tool integration:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
name: weather-query-with-tool
spec:
task_type: assistant
turns:
- role: user
content: "What's the weather in San Francisco?"
assertions:
- type: tools_called
params:
tools: ["get_weather"]
message: "Should call weather tool"
- type: content_includes
params:
patterns: ["temperature"]
message: "Should mention temperature"

Reuse common data across scenarios:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
name: account-specific-test
spec:
task_type: support
fixtures:
sample_user:
name: "Jane Smith"
account_id: "12345"
plan: "Premium"
system_template: |
You are helping user {{fixtures.sample_user.name}}
(Account: {{fixtures.sample_user.account_id}}).
turns:
- role: user
content: "What's my current plan?"
assertions:
- type: content_includes
params:
patterns: ["Premium"]
message: "Should reference user's Premium plan"
apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
name: order-status-check
spec:
task_type: support
turns:
- role: user
content: "Check my order status"
assertions:
- type: content_includes
params:
patterns: ["shipped"]
message: "Should mention shipped status"

Test the same scenario across different models:

# In arena.yaml
scenarios:
- path: ./scenarios/cross-provider.yaml
providers: [openai-gpt4, claude-sonnet, gemini-pro]
# Good
- name: "Escalation: Frustrated customer with billing issue"
# Avoid
- name: "Test 1"
tags: [billing, high-priority, multi-turn, tool-calling]

Useful for selective test runs:

Terminal window
# Run only high-priority tests
promptarena run --scenario high-priority
apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
name: balance-check-flexible
spec:
task_type: support
turns:
# Too specific (brittle)
- role: user
content: "What's my account balance?"
assertions:
patterns: ["Your account balance is exactly $42.00"]
message: "Exact match - too brittle"
# Better (flexible)
- role: user
content: "What's my account balance?"
assertions:
- type: content_includes
params:
patterns: ["account balance"]
message: "Should mention balance"
- type: content_matches
params:
pattern: '\$\d+\.\d{2}'
message: "Should include dollar amount"
apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
name: edge-case-testing
spec:
task_type: support
fixtures:
long_patterns: ["Very long text here..."] # 10,000 chars
turns:
- role: user
content: ""
assertions:
- type: content_includes
params:
patterns: ["help"]
message: "Should handle empty input gracefully"
- role: user
content: "{{fixtures.long_text}}"
assertions:
- type: content_not_empty
params:
message: "Should respond to very long input"

Structure your scenarios for maintainability:

scenarios/
├── customer-support/
│ ├── basic-inquiries.yaml
│ ├── billing-issues.yaml
│ └── escalations.yaml
├── content-generation/
│ ├── blog-posts.yaml
│ └── social-media.yaml
└── tool-integration/
├── weather.yaml
└── database.yaml

See complete working examples in the examples directory.