Guardrails Test Example
This example demonstrates the guardrail assertion feature, which allows you to test whether validators (guardrails) trigger as expected in your prompt configurations.
Overview
The guardrail_triggered assertion type enables you to:
- Verify guardrails trigger when they should - Test that your validators catch problematic inputs
- Verify guardrails don’t trigger when they shouldn’t - Ensure clean inputs pass through without false positives
- Test in non-production mode - Use
suppress_validation_exceptions: trueto allow execution to continue after validation failures, so you can assert on the guardrail behavior
Key Concept: SuppressValidationExceptions
By default, when a validator fails, it throws a ValidationError and halts execution. This is the correct behavior for production.
For testing purposes, Arena automatically enables SuppressValidationExceptions mode in the validator middleware. This allows:
- The validator to run and record its result (pass/fail)
- Execution to continue even if validation fails
- Assertions to inspect whether the guardrail triggered
Important: This suppression behavior is built into Arena’s pipeline construction, not configured in the PromptConfig. Your production prompt configurations remain unchanged - they use the same validator definitions for both production and testing.
Example Structure
guardrails-test/
├── arena.yaml # Test scenarios with guardrail_triggered assertions
├── prompts/
│ └── content-filter.yaml # Prompt with banned_words validator
└── providers/
└── openai.yaml # OpenAI provider configuration
Configuration Details
Prompt Configuration (prompts/content-filter.yaml)
The prompt includes a banned_words validator - the same configuration used in production:
validators:
- type: banned_words
params:
words:
- damn
- crap
- hell
case_sensitive: false
Note: No special test-only flags are needed in the PromptConfig. Arena’s test framework automatically enables suppression mode when running validators.
Test Scenarios (arena.yaml)
Four test scenarios demonstrate different assertion patterns:
guardrail-should-trigger: Input contains banned words → expect validator to triggerguardrail-should-not-trigger: Clean input → expect validator not to triggermultiple-violations: Multiple banned words → expect validator to triggerstreaming-guardrail-trigger: Tests guardrail in streaming mode → expect validator to trigger and interrupt stream
Each scenario uses the guardrail_triggered assertion:
assertions:
- type: guardrail_triggered
validator: banned_words # Name of the validator to check
should_trigger: true # Expected behavior (true = should fail, false = should pass)
message: "Descriptive message for test output"
Running the Tests
-
Set up your OpenAI API key:
export OPENAI_API_KEY="your-api-key-here" -
Run the Arena tests:
promptarena run examples/guardrails-test/arena.yaml
Expected Results
- ✅ guardrail-should-trigger: PASS (validator triggered as expected)
- ✅ guardrail-should-not-trigger: PASS (validator did not trigger as expected)
- ✅ multiple-violations: PASS (validator triggered on multiple violations as expected)
- ✅ streaming-guardrail-trigger: PASS (validator triggered in streaming mode and interrupted stream as expected)
Streaming Mode Support
The streaming-guardrail-trigger scenario demonstrates how guardrails work with streaming responses:
- Real-time validation: Validators process each chunk as it arrives
- Stream interruption: When a validation fails, the stream is immediately interrupted
- Suppression behavior: With suppression enabled (Arena test mode), the stream interrupts but no error is thrown
- Recorded results: Validation failures are still recorded in metadata for assertions to inspect
This ensures that guardrails work correctly in both regular and streaming execution modes.
How It Works
-
Execution Phase:
- User input is processed through the prompt
- Validators run (Arena automatically enables suppression mode)
- Validation results are recorded in execution context, but errors are suppressed
- LLM generates a response
-
Assertion Phase:
- The
guardrail_triggeredassertion inspects the execution context - It finds the last assistant message and its validation results
- It checks if the specified validator passed or failed
- It compares the actual result against the
should_triggerexpectation
- The
-
Test Outcome:
- If actual behavior matches expectation → Test PASS
- If actual behavior differs from expectation → Test FAIL with descriptive error
Use Cases
This pattern is valuable for:
- Regression testing: Ensure guardrails continue to work as expected over time
- Configuration validation: Verify validator configs (banned word lists, patterns, etc.) are correct
- Coverage testing: Confirm edge cases are properly handled by your guardrails
- CI/CD integration: Automated testing of prompt safety measures
Production vs Test Mode
Production Mode (SDK/Conversation API - default):
// Production code uses DynamicValidatorMiddleware with default behavior
middleware.DynamicValidatorMiddleware(registry)
- Validation failures throw errors immediately
- Execution halts on first validation failure
- Appropriate for live user-facing systems
Test Mode (Arena test framework):
// Arena automatically uses suppression mode
middleware.DynamicValidatorMiddlewareWithSuppression(registry, true)
- Validation failures are logged but don’t throw errors
- Execution continues so assertions can inspect results
- Appropriate for automated testing and development
The same PromptConfig works in both modes - no test-specific configuration needed!
Related Documentation
- GUARDRAIL_ASSERTION_PROPOSAL.md - Original proposal
- GitHub Issue #25 - Implementation tracking
- Arena User Guide - General Arena testing documentation