Duplex Streaming Architecture
Understanding how PromptArena handles bidirectional audio streaming for voice assistant testing.
What is Duplex Streaming?
Section titled “What is Duplex Streaming?”Duplex streaming enables real-time bidirectional communication between your test scenario and an LLM provider. Unlike traditional request-response patterns, duplex streaming:
- Sends audio in small chunks as it’s being “spoken”
- Receives responses while still sending input
- Handles dynamic turn detection (knowing when someone stops speaking)
- Maintains a persistent WebSocket connection
This mirrors how real voice assistants work, making it essential for testing voice interfaces.
Traditional vs Duplex Audio Testing
Section titled “Traditional vs Duplex Audio Testing”Traditional Audio Testing
Section titled “Traditional Audio Testing”┌──────────────────────────────────────────────────┐│ 1. Load entire audio file ││ 2. Send as single blob to provider ││ 3. Wait for complete transcription ││ 4. Get text response ││ 5. Move to next turn │└──────────────────────────────────────────────────┘Limitations:
- No real-time interaction
- Can’t test interruption handling
- Doesn’t reflect actual voice UX
- Turn boundaries are artificial
Duplex Audio Testing
Section titled “Duplex Audio Testing”┌──────────────────────────────────────────────────┐│ 1. Open WebSocket session ││ 2. Stream audio chunks (640 bytes = 20ms) ││ 3. Provider detects speech/silence boundaries ││ 4. Receive audio/text response in real-time ││ 5. Continue streaming more input │└──────────────────────────────────────────────────┘Benefits:
- Tests real-time voice interaction
- Validates turn detection behavior
- Can test interruption scenarios
- Mirrors production voice assistant UX
Pipeline Architecture
Section titled “Pipeline Architecture”Duplex testing uses the same pipeline architecture as non-duplex, with specialized stages:
┌─────────────────────────────────────────────────────────────┐│ Streaming Pipeline │├─────────────────────────────────────────────────────────────┤│ AudioTurnStage ──► PromptAssemblyStage ──► DuplexProvider ││ (VAD) (system prompt) Stage ││ │ ││ ◄─────┘ ││ (WebSocket to Gemini) ││ │ ││ MediaExternalizer ◄── ValidationStage ◄─────┘ ││ Stage (guardrails) ││ │ ││ ▼ ││ ArenaStateStoreSaveStage ──► Results │└─────────────────────────────────────────────────────────────┘Key Pipeline Stages
Section titled “Key Pipeline Stages”| Stage | Purpose |
|---|---|
| AudioTurnStage | Optional client-side VAD for turn detection |
| PromptAssemblyStage | Loads prompt config, adds system instruction to metadata |
| DuplexProviderStage | Creates WebSocket session, handles bidirectional I/O |
| MediaExternalizerStage | Saves audio responses to files |
| ValidationStage | Runs assertions and guardrails on responses |
| ArenaStateStoreSaveStage | Persists messages for reporting |
Session Lifecycle
Section titled “Session Lifecycle”Session Creation
Section titled “Session Creation”Unlike traditional pipelines where each turn creates a new request, duplex maintains a persistent session:
First Audio Chunk Arrives │ ▼┌─────────────────────────────┐│ Extract system_prompt from ││ element metadata │└─────────────────────────────┘ │ ▼┌─────────────────────────────┐│ Create WebSocket session ││ with system instruction │└─────────────────────────────┘ │ ▼┌─────────────────────────────┐│ Process audio chunks ││ in real-time loop │└─────────────────────────────┘The session is created lazily when the first element arrives, using configuration from the pipeline metadata.
Turn Detection
Section titled “Turn Detection”Two modes are available for detecting when a speaker has finished: ASM (provider-native) and VAD (client-side).
ASM Mode (Provider-Native)
Section titled “ASM Mode (Provider-Native)”The provider (e.g., Gemini Live API) handles turn detection internally:
duplex: turn_detection: mode: asm- Provider signals when user stops speaking
- Simpler configuration
- Provider-specific behavior
VAD Mode (Voice Activity Detection)
Section titled “VAD Mode (Voice Activity Detection)”Client-side VAD with configurable thresholds:
duplex: turn_detection: mode: vad vad: silence_threshold_ms: 600 min_speech_ms: 200- Precise control over turn boundaries
- Consistent across providers
- Requires threshold tuning
Audio Processing
Section titled “Audio Processing”Input Audio Format
Section titled “Input Audio Format”Audio must be in raw PCM format:
| Parameter | Value | Reason |
|---|---|---|
| Format | Raw PCM (no headers) | Direct streaming |
| Sample Rate | 16000 Hz | Gemini requirement |
| Bit Depth | 16-bit | Standard voice quality |
| Channels | Mono | Voice doesn’t need stereo |
Chunk Streaming
Section titled “Chunk Streaming”Audio is sent in small chunks to enable real-time processing:
Audio File (10 seconds) │ ▼┌───────────────────────────────────────┐│ Chunk 1 │ Chunk 2 │ ... │ Chunk N ││ (20ms) │ (20ms) │ │ (20ms) ││ 640 B │ 640 B │ │ 640 B │└───────────────────────────────────────┘ │ ▼Streamed to provider via WebSocketChunk size calculation:
- 16000 samples/second × 2 bytes/sample × 0.02 seconds = 640 bytes per 20ms chunk
Burst Mode vs Real-time Mode
Section titled “Burst Mode vs Real-time Mode”Burst Mode (Default for Testing)
Section titled “Burst Mode (Default for Testing)”Sends all audio as fast as possible:
[Chunk 1][Chunk 2][Chunk 3]...[Chunk N] ──► Provider │ ▼ ResponseBest for: Pre-recorded audio, avoiding false turn detections from natural pauses.
Real-time Mode
Section titled “Real-time Mode”Paces audio to match actual speech timing:
[Chunk 1] ─(20ms)─► [Chunk 2] ─(20ms)─► [Chunk 3] ...Best for: Testing real-time interaction, interruption handling.
Self-Play with TTS
Section titled “Self-Play with TTS”For fully automated testing, self-play mode uses TTS to generate audio dynamically:
┌─────────────────────────────────────────────────────────────┐│ Self-Play Turn Flow │├─────────────────────────────────────────────────────────────┤│ ││ 1. Collect conversation history from state store ││ │ ││ ▼ ││ 2. Send history to self-play LLM with persona prompt ││ │ ││ ▼ ││ 3. LLM generates next user message (text) ││ │ ││ ▼ ││ 4. TTS converts text to audio ││ │ ││ ▼ ││ 5. Stream audio to duplex session ││ │ ││ ▼ ││ 6. Capture and validate response ││ │└─────────────────────────────────────────────────────────────┘This enables testing multi-turn voice conversations without pre-recording audio files.
Error Handling and Resilience
Section titled “Error Handling and Resilience”Voice sessions are inherently less stable than text sessions due to:
- Network latency variations
- Provider-side connection limits
- Audio processing delays
- Turn detection edge cases
Resilience Configuration
Section titled “Resilience Configuration”duplex: resilience: max_retries: 2 # Retry failed sessions retry_delay_ms: 2000 # Wait between retries inter_turn_delay_ms: 500 # Pause between turns partial_success_min_turns: 2 # Accept if N turns succeed ignore_last_turn_session_end: truePartial Success
Section titled “Partial Success”Not all tests need to complete every turn. For exploratory testing:
resilience: partial_success_min_turns: 3 # Success if 3+ turns completeThis allows testing to continue even when sessions end early.
Comparison with SDK Duplex
Section titled “Comparison with SDK Duplex”Both Arena and the SDK use the same underlying runtime for duplex streaming:
| Aspect | Arena | SDK |
|---|---|---|
| Pipeline builder | Internal | Configurable |
| Session lifecycle | Managed by executor | Managed by application |
| State storage | Arena state store | Application-provided |
| Use case | Automated testing | Production applications |
The runtime/streaming package provides shared utilities for both.
Design Decisions
Section titled “Design Decisions”Why Pipeline-First Architecture?
Section titled “Why Pipeline-First Architecture?”The pipeline runs before session creation because:
- Consistency: Same pattern as non-duplex pipelines
- Flexibility: Prompt assembly can vary per scenario
- Validation: Guardrails apply to all response types
- Debugging: Each stage can be inspected independently
Why Lazy Session Creation?
Section titled “Why Lazy Session Creation?”Sessions are created when the first audio arrives because:
- Configuration: System prompt comes from pipeline metadata
- Resource efficiency: Don’t create sessions that won’t be used
- Error handling: Pipeline errors caught before session cost
Why Burst Mode for Pre-recorded Audio?
Section titled “Why Burst Mode for Pre-recorded Audio?”Provider turn detection can trigger mid-utterance with natural speech pauses. Burst mode sends all audio before any turn detection occurs, preventing “user interrupted” false positives.
See Also
Section titled “See Also”- Tutorial: Duplex Voice Testing - Hands-on guide
- Duplex Configuration Reference - Full config options
- Testing Philosophy - Core testing principles