Gate model migrations on a regression suite
This how-to walks through examples/model-migration/ — a small regression suite that runs the same scenarios against two mock providers simulating a model upgrade. The pattern works for any “did this swap break anything” question: GPT-4o → 4o-mini, Claude 3.5 → 4, OpenAI → Anthropic.
What it proves
Section titled “What it proves”Migration testing has a sneaky failure mode: most prompts still work after a model swap, so you trust the change. The 5% that broke produce subtly wrong outputs that pass visual review and only get caught in production.
PromptArena makes migration a real gate:
- One scenario set, multiple providers registered, one report.
- Identical assertions per scenario; each must pass on every registered model.
- If any model fails an assertion the others passed, the report flags it and
run --ciexits non-zero. - Pin the suite in CI before the migration PR can merge.
Run it
Section titled “Run it”cd examples/model-migrationpromptarena serveThe web UI groups runs by provider; expand a scenario to see each model’s output and assertion results.
Headless / CI:
promptarena run --ci --formats html,jsonopen out/report.htmlKeyless: both providers are mock. The report shows side-by-side results.
What a regression looks like
Section titled “What a regression looks like”The default config has both mock providers passing all assertions. To see how a regression surfaces, edit mock-responses-v2.yaml to make the new model break the one-word format:
# Was: "technical"# Now: model adds explanation despite the prompt's instructiontech-inquiry: turns: 1: "This sounds like a technical issue with the application."Re-run: the max_length assertion fires on v2 but not v1 — the regression is caught:
v1 / billing-inquiry: ✓ content_includes(billing) ✓ max_length(<30)v1 / tech-inquiry: ✓ content_includes(technical) ✓ max_length(<30)v2 / billing-inquiry: ✓ content_includes(billing) ✓ max_length(<30)v2 / tech-inquiry: ✓ content_includes(technical) ✗ max_length(60 > 30)Error: execution failed: 1 runs had errorsThe CI snippet below uses set -e so the migration PR fails until the prompt is reworked or the new model swapped out.
Adding real models
Section titled “Adding real models”Swap the mock providers in config.arena.yaml:
providers: - file: providers/openai-gpt4o.provider.yaml # incumbent - file: providers/openai-gpt4o-mini.provider.yaml # candidate - file: providers/anthropic-claude-haiku.provider.yaml # alternate candidateSame scenarios, same assertions. The report fans out across every provider; CI gates on all of them passing.
CI gate
Section titled “CI gate”# .github/workflows/model-migration.ymlname: Model migration regression
on: pull_request: paths: - 'examples/model-migration/**'
jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-go@v5 with: go-version: '1.26' - run: make build-arena - name: Run migration regression suite working-directory: examples/model-migration run: ../../bin/promptarena run --ci --formats html,json - name: Upload report if: always() uses: actions/upload-artifact@v4 with: name: model-migration-report path: examples/model-migration/out/Uploading the report as an artifact lets reviewers eyeball the per-model output on the PR — useful when the question is “did this prompt regress on the new model?” rather than just “did anything fail?”
Extending it
Section titled “Extending it”- Add a third model: drop in
providers/anthropic-claude-haiku.provider.yaml, register it, run. The fan-out scales automatically. - Behavior-equivalent assertions:
outcome_equivalentlets you assert that the agent’s tool-call pattern (or workflow state, or content hash) matches an expected outcome — useful for migrating between models without the prompt changing. - Per-model thresholds: if a new model has known stricter / looser behaviour on certain assertions, use
when:clauses to scope thresholds per provider (see the bake-off how-to for the pattern).