Assert per-turn latency budgets
This how-to walks through examples/voice-latency-budget/ — a small scenario that asserts every turn’s provider latency against an explicit max_ms budget. Useful for gating deploys on regressions, comparing providers on the same scenario, and proving an agent stays inside a real-time budget.
What it proves
Section titled “What it proves”LLM-driven systems silently slow down: a small prompt change adds a hidden retrieval step, a provider quietly degrades, a tool call gets retried, an agent loops one extra round. Pure single-turn eval misses this — the response looks right, it’s just slow. PromptArena makes latency a first-class signal:
- The provider stage records
LatencyMson every assistant message (LLM round-trip including any in-turn tool-call rounds). - Arena bridges
LatencyMsinto the eval context metadata aslatency_msso the standardlatency_budgetassertion reads it without any custom plumbing. - Scenarios gate each turn:
max_ms: 1000fails any reply slower than a second.
The assertion shape
Section titled “The assertion shape”turns: - role: user content: "Hi, I need help with my account." assertions: - type: latency_budget params: max_ms: 1000 message: "Turn must respond within 1000ms"latency_budget returns a score normalised to the budget: min(1.0, max_ms / latency_ms). A reply within budget scores 1.0; a reply at 2× the budget scores 0.5. The HTML report shows the exact latency_ms vs budget_ms per turn.
Run it
Section titled “Run it”cd examples/voice-latency-budgetpromptarena serveserve loads the scenario into the web UI; the timeline view shows the latency assertion alongside the conversation. Headless:
promptarena run --ci --formats html,jsonopen out/report.htmlThe default config runs against a text mock provider — sub-millisecond responses, so the budget passes trivially. The interesting signal comes from real providers.
Comparing providers
Section titled “Comparing providers”Add multiple provider files and re-register them in config.arena.yaml. Arena fans out the scenario across every registered provider; the HTML report shows the per-provider latency_ms distribution side by side. Use this for migration testing (“does Claude Haiku stay inside our 800ms budget on this prompt?”) or for cross-provider bake-offs.
What’s measured today
Section titled “What’s measured today”latency_budget checks the total provider-call duration per turn: LLM round-trip time, including any tool-call rounds that happen within that turn. It’s a coarse “is this turn fast enough” signal — well-suited to gating regressions.
What richer voice testing wants — and what’s coming next — is per-metric capture:
- TTFB — time to first token / first audio frame
- First-audio — time from user-input-end to first audio-out (duplex providers)
- End-of-turn delta — silence-detection latency between generation-complete and turn-complete
Provider stages capture some of these timings internally; they just don’t yet flow into the eval context as named metadata keys. When they do, latency_budget will accept per-metric thresholds (max_ttfb_ms, max_first_audio_ms, …); the existing single-metric usage continues to work.
CI gate
Section titled “CI gate”The mock-provider path runs keyless, so the demo fits a fork-safe CI job:
# .github/workflows/voice-latency-budget.ymlname: Voice latency budget
on: pull_request: paths: - 'examples/voice-latency-budget/**'
jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-go@v5 with: go-version: '1.26' - run: make build-arena - name: Run latency-budget scenarios working-directory: examples/voice-latency-budget run: ../../bin/promptarena run --ci --formats jsonFor real-provider runs, gate on the same step with provider keys via secrets:, and bump max_ms to a value that reflects your production budget.
Extending it
Section titled “Extending it”- Tighter budgets per turn: vary
max_msper turn — the greeting might be fast, the tool-using turn slower. - Mix with content assertions: combine
latency_budgetwithcontent_includes/llm_judgeto test both correctness and speed. - Per-provider thresholds: when running across providers, the assertion config can include
when:clauses to apply different budgets per provider.