Tutorial 5: CI/CD Integration
Learn how to integrate PromptArena testing into your CI/CD pipeline for automated quality gates.
What You’ll Learn
Section titled “What You’ll Learn”- Set up Arena in GitHub Actions
- Configure quality gates
- Generate CI-friendly reports
- Handle API keys securely
- Optimize test execution for CI
- Create deployment gates
Prerequisites
Section titled “Prerequisites”- Completed Tutorials 1-4
- GitHub repository
- Basic CI/CD knowledge
Why Automate LLM Testing?
Section titled “Why Automate LLM Testing?”Manual testing doesn’t scale. Automated testing in CI/CD:
- Catches regressions before deployment
- Enforces quality standards automatically
- Validates changes on every commit
- Provides confidence for releases
- Documents behavior over time
Step 1: Prepare Your Tests
Section titled “Step 1: Prepare Your Tests”Organize tests for CI execution:
my-project/├── .github/│ └── workflows/│ └── llm-tests.yml├── tests/│ ├── arena.yaml│ ├── prompts/│ ├── providers/│ └── scenarios/│ ├── critical/ # Must-pass tests│ │ └── core.yaml│ ├── integration/ # Full suite│ │ └── full.yaml│ └── smoke/ # Quick validation│ └── basic.yamlStep 2: Choose Installation Method
Section titled “Step 2: Choose Installation Method”For CI/CD, we recommend using npm for PromptArena installation:
Why npm for CI?
- ✅ No Go toolchain required (faster setup)
- ✅ Works on all platforms (Linux, macOS, Windows)
- ✅ Simpler workflow (one command)
- ✅ Node.js typically already in CI environments
- ✅ Faster installation time
Alternative: Go Install (if you prefer):
- name: Set up Go uses: actions/setup-go@v5 with: go-version: '1.23'
- name: Install PromptArena run: go install github.com/altairalabs/promptkit/tools/arena@latestFor this tutorial, we’ll use npm as it’s simpler and faster.
Step 3: Create GitHub Actions Workflow
Section titled “Step 3: Create GitHub Actions Workflow”Create .github/workflows/llm-tests.yml:
name: LLM Quality Tests
on: # Run on every push to main and PRs push: branches: [main] pull_request: branches: [main]
# Allow manual trigger workflow_dispatch:
jobs: # Fast smoke tests (< 1 minute) smoke-tests: runs-on: ubuntu-latest timeout-minutes: 5
steps: - name: Checkout code uses: actions/checkout@v4
- name: Install PromptArena run: npm install -g @altairalabs/promptarena
- name: Run smoke tests with mock provider working-directory: tests run: | promptarena run \ --scenario smoke \ --mock-provider \ --ci \ --format junit,json
- name: Upload smoke test results uses: actions/upload-artifact@v4 if: always() with: name: smoke-test-results path: tests/out/
# Comprehensive tests (with real LLMs) integration-tests: runs-on: ubuntu-latest needs: smoke-tests # Only run if smoke tests pass timeout-minutes: 15
steps: - name: Checkout code uses: actions/checkout@v4
- name: Install PromptArena run: npm install -g @altairalabs/promptarena
- name: Run integration tests working-directory: tests env: OPENAI_API_KEY: $ ANTHROPIC_API_KEY: $ run: | promptarena run \ --scenario integration \ --ci \ --concurrency 2 \ --format junit,json,html
- name: Publish test results uses: dorny/test-reporter@v1 if: always() with: name: LLM Integration Tests path: tests/out/junit.xml reporter: java-junit
- name: Upload test artifacts uses: actions/upload-artifact@v4 if: always() with: name: integration-test-results path: tests/out/
- name: Check quality gate run: | PASS_RATE=$(jq '.summary.pass_rate' tests/out/results.json) echo "Pass rate: $PASS_RATE"
if (( $(echo "$PASS_RATE < 0.95" | bc -l) )); then echo "❌ Quality gate failed: Pass rate $PASS_RATE < 95%" exit 1 fi
echo "✅ Quality gate passed: $PASS_RATE >= 95%"Step 4: Configure API Keys as Secrets
Section titled “Step 4: Configure API Keys as Secrets”In your GitHub repository:
- Go to Settings → Secrets and variables → Actions
- Click New repository secret
- Add your API keys:
OPENAI_API_KEYANTHROPIC_API_KEYGOOGLE_API_KEY
Step 5: Create Test Suites for CI
Section titled “Step 5: Create Test Suites for CI”Smoke Tests (Fast Validation)
Section titled “Smoke Tests (Fast Validation)”scenarios/smoke/basic.yaml:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: basic-response-test labels: category: smoke speed: fast
spec: task_type: support
turns: - role: user content: "Hello" assertions: - type: content_length params: max: 200 message: "Response should be brief"Critical Path Tests
Section titled “Critical Path Tests”scenarios/critical/core.yaml:
apiVersion: promptkit.altairalabs.ai/v1alpha1kind: Scenariometadata: name: core-functionality labels: category: critical priority: must-pass
spec: task_type: support
turns: - role: user content: "What are your business hours?" assertions: - type: content_includes params: patterns: ["Monday"] message: "Should mention business days"params: max_seconds: 3 message: "Should respond quickly"
- role: user content: "How do I contact support?" assertions: - type: content_includes params: patterns: ["email"] message: "Should provide contact methods"Step 6: Add Quality Gates
Section titled “Step 6: Add Quality Gates”Create tests/quality-gates.sh:
#!/bin/bashset -e
RESULTS_FILE="out/results.json"
# Extract metricsPASS_RATE=$(jq '.summary.pass_rate' $RESULTS_FILE)TOTAL=$(jq '.summary.total' $RESULTS_FILE)FAILED=$(jq '.summary.failed' $RESULTS_FILE)
echo "📊 Test Results:"echo " Total: $TOTAL"echo " Pass Rate: $PASS_RATE"echo " Failed: $FAILED"
# Quality gatesMIN_PASS_RATE=0.95MAX_FAILURES=5
echo ""echo "🚦 Quality Gates:"
# Check pass rateif (( $(echo "$PASS_RATE < $MIN_PASS_RATE" | bc -l) )); then echo "❌ Pass rate $PASS_RATE < $MIN_PASS_RATE" exit 1fiecho "✅ Pass rate: $PASS_RATE >= $MIN_PASS_RATE"
# Check failure countif [ "$FAILED" -gt "$MAX_FAILURES" ]; then echo "❌ Too many failures: $FAILED > $MAX_FAILURES" exit 1fiecho "✅ Failure count: $FAILED <= $MAX_FAILURES"
echo ""echo "✅ All quality gates passed!"Use in workflow:
- name: Run tests run: promptarena run --ci --format json
- name: Check quality gates run: bash tests/quality-gates.shStep 7: Optimize for CI Performance
Section titled “Step 7: Optimize for CI Performance”Use Concurrency Control
Section titled “Use Concurrency Control”# Respect rate limitspromptarena run --concurrency 2 --ciCache Test Results
Section titled “Cache Test Results”- name: Cache test dependencies uses: actions/cache@v3 with: path: | ~/.cache/go-build ~/go/pkg/mod key: $-go-$Selective Testing
Section titled “Selective Testing”# Only run changed scenariosif [ "$GITHUB_EVENT_NAME" = "pull_request" ]; then promptarena run --scenario critical --cielse promptarena run --ci # Full suite on mainfiStep 8: PR Comments with Results
Section titled “Step 8: PR Comments with Results”Add test results to PR comments:
- name: Comment PR with results if: github.event_name == 'pull_request' uses: actions/github-script@v7 with: script: | const fs = require('fs'); const results = JSON.parse(fs.readFileSync('tests/out/results.json', 'utf8'));
const summary = results.summary; const passRate = (summary.pass_rate * 100).toFixed(1);
const comment = ` ## 🤖 LLM Test Results
| Metric | Value | |--------|-------| | Total Tests | ${summary.total} | | Passed | ${summary.passed} ✅ | | Failed | ${summary.failed} ❌ | | Pass Rate | ${passRate}% |
${passRate >= 95 ? '✅ Quality gate: PASSED' : '❌ Quality gate: FAILED'}
[View detailed report](https://github.com/$/actions/runs/$) `;
github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: comment });Step 9: Multi-Environment Testing
Section titled “Step 9: Multi-Environment Testing”Test across dev, staging, production:
strategy: matrix: environment: [dev, staging, prod]
steps: - name: Run tests in $ env: OPENAI_API_KEY: ${{ secrets[format('{0}_OPENAI_API_KEY', matrix.environment)] }} run: | promptarena run \ --config arena-$.yaml \ --ci \ --out out/$Step 10: Scheduled Testing
Section titled “Step 10: Scheduled Testing”Run tests on a schedule:
on: schedule: # Every 6 hours - cron: '0 */6 * * *'
jobs: scheduled-tests: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4
- name: Install PromptArena run: npm install -g @altairalabs/promptarena
- name: Run full test suite env: OPENAI_API_KEY: $ run: | promptarena run --ci --format json,html
- name: Notify on failure if: failure() uses: slackapi/slack-github-action@v1 with: payload: | { "text": "🚨 Scheduled LLM tests failed", "blocks": [ { "type": "section", "text": { "type": "mrkdwn", "text": "*Scheduled LLM Tests Failed*\n<https://github.com/$/actions/runs/$|View Details>" } } ] } env: SLACK_WEBHOOK_URL: $Step 11: Deployment Gates
Section titled “Step 11: Deployment Gates”Block deployments on test failures:
# .github/workflows/deploy.ymlname: Deploy
on: push: branches: [main]
jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4
- name: Install PromptArena run: npm install -g @altairalabs/promptarena
- name: Run LLM tests env: OPENAI_API_KEY: $ run: | promptarena run --scenario critical --ci
deploy: needs: test # Only deploy if tests pass runs-on: ubuntu-latest steps: - name: Deploy to production run: | echo "Deploying..." # Your deployment commandsComplete Example Workflow
Section titled “Complete Example Workflow”Here’s a production-ready workflow:
name: LLM Quality Pipeline
on: push: branches: [main, develop] pull_request: branches: [main] schedule: - cron: '0 */6 * * *'
env: GO_VERSION: '1.23'
jobs: smoke: name: Smoke Tests (Mock) runs-on: ubuntu-latest timeout-minutes: 5 steps: - uses: actions/checkout@v4
- name: Install Arena run: npm install -g @altairalabs/promptarena
- name: Smoke tests working-directory: tests run: promptarena run --scenario smoke --mock-provider --ci --format junit
- name: Publish results uses: dorny/test-reporter@v1 if: always() with: name: Smoke Tests path: tests/out/junit.xml reporter: java-junit
critical: name: Critical Path Tests needs: smoke runs-on: ubuntu-latest timeout-minutes: 10 steps: - uses: actions/checkout@v4
- name: Install Arena run: npm install -g @altairalabs/promptarena
- name: Critical tests working-directory: tests env: OPENAI_API_KEY: $ run: | promptarena run \ --scenario critical \ --ci \ --concurrency 2 \ --format junit,json,html
- name: Quality gate working-directory: tests run: | PASS_RATE=$(jq '.summary.pass_rate' out/results.json) if (( $(echo "$PASS_RATE < 0.95" | bc -l) )); then echo "❌ Quality gate failed: $PASS_RATE < 95%" exit 1 fi
- name: Upload results uses: actions/upload-artifact@v4 if: always() with: name: critical-test-results path: tests/out/
integration: name: Integration Tests needs: critical if: github.event_name == 'push' && github.ref == 'refs/heads/main' runs-on: ubuntu-latest timeout-minutes: 20 steps: - uses: actions/checkout@v4
- name: Install Arena run: npm install -g @altairalabs/promptarena
- name: Full test suite working-directory: tests env: OPENAI_API_KEY: $ ANTHROPIC_API_KEY: $ GOOGLE_API_KEY: $ run: | promptarena run \ --ci \ --concurrency 3 \ --format junit,json,html
- name: Upload results uses: actions/upload-artifact@v4 if: always() with: name: integration-test-results path: tests/out/Best Practices
Section titled “Best Practices”1. Fast Feedback Loop
Section titled “1. Fast Feedback Loop”# Stage 1: Mock tests (< 1 min)# Stage 2: Critical tests (< 5 min)# Stage 3: Full suite (< 20 min)2. Fail Fast
Section titled “2. Fail Fast”timeout-minutes: 10 # Kill hung tests--concurrency 2 # Respect rate limits--ci # Optimized output3. Secure Secrets
Section titled “3. Secure Secrets”# ✅ Use GitHub Secretsenv: OPENAI_API_KEY: $
# ❌ Never commit keysenv: OPENAI_API_KEY: "sk-..." # WRONG!4. Informative Reports
Section titled “4. Informative Reports”# Generate multiple formats--format junit,json,html
# Upload for reviewuses: actions/upload-artifact@v4Troubleshooting
Section titled “Troubleshooting”Tests Timeout
Section titled “Tests Timeout”# Increase timeouttimeout-minutes: 20
# Reduce concurrency--concurrency 1Rate Limiting
Section titled “Rate Limiting”# Lower concurrency--concurrency 1
# Use mock providers for structure validation--mock-providerSecrets Not Working
Section titled “Secrets Not Working”# Check secret is setecho "Key length: ${#OPENAI_API_KEY}"# Should output: Key length: 51 (not 0)Next Steps
Section titled “Next Steps”Congratulations! You’ve completed all Arena tutorials.
What’s next:
- Arena How-To Guides - Practical recipes
- Arena Reference - Complete API docs
- SDK Tutorials - Integrate Arena with your app
Advanced topics:
- Set up trend analysis
- Create custom quality metrics
- Build deployment pipelines
- Implement A/B testing
You’re now ready to build production-grade LLM testing pipelines!