Skip to content

Duplex Streaming Architecture

Understanding how PromptArena handles bidirectional audio streaming for voice assistant testing.

Duplex streaming enables real-time bidirectional communication between your test scenario and an LLM provider. Unlike traditional request-response patterns, duplex streaming:

  • Sends audio in small chunks as it’s being “spoken”
  • Receives responses while still sending input
  • Handles dynamic turn detection (knowing when someone stops speaking)
  • Maintains a persistent WebSocket connection

This mirrors how real voice assistants work, making it essential for testing voice interfaces.

┌──────────────────────────────────────────────────┐
│ 1. Load entire audio file │
│ 2. Send as single blob to provider │
│ 3. Wait for complete transcription │
│ 4. Get text response │
│ 5. Move to next turn │
└──────────────────────────────────────────────────┘

Limitations:

  • No real-time interaction
  • Can’t test interruption handling
  • Doesn’t reflect actual voice UX
  • Turn boundaries are artificial
┌──────────────────────────────────────────────────┐
│ 1. Open WebSocket session │
│ 2. Stream audio chunks (640 bytes = 20ms) │
│ 3. Provider detects speech/silence boundaries │
│ 4. Receive audio/text response in real-time │
│ 5. Continue streaming more input │
└──────────────────────────────────────────────────┘

Benefits:

  • Tests real-time voice interaction
  • Validates turn detection behavior
  • Can test interruption scenarios
  • Mirrors production voice assistant UX

Duplex testing uses the same pipeline architecture as non-duplex, with specialized stages:

┌─────────────────────────────────────────────────────────────┐
│ Streaming Pipeline │
├─────────────────────────────────────────────────────────────┤
│ AudioTurnStage ──► PromptAssemblyStage ──► DuplexProvider │
│ (VAD) (system prompt) Stage │
│ │ │
│ ◄─────┘ │
│ (WebSocket to Gemini) │
│ │ │
│ MediaExternalizer ◄── ValidationStage ◄─────┘ │
│ Stage (guardrails) │
│ │ │
│ ▼ │
│ ArenaStateStoreSaveStage ──► Results │
└─────────────────────────────────────────────────────────────┘
StagePurpose
AudioTurnStageOptional client-side VAD for turn detection
PromptAssemblyStageLoads prompt config, adds system instruction to metadata
DuplexProviderStageCreates WebSocket session, handles bidirectional I/O
MediaExternalizerStageSaves audio responses to files
ValidationStageRuns assertions and guardrails on responses
ArenaStateStoreSaveStagePersists messages for reporting

Unlike traditional pipelines where each turn creates a new request, duplex maintains a persistent session:

First Audio Chunk Arrives
┌─────────────────────────────┐
│ Extract system_prompt from │
│ element metadata │
└─────────────────────────────┘
┌─────────────────────────────┐
│ Create WebSocket session │
│ with system instruction │
└─────────────────────────────┘
┌─────────────────────────────┐
│ Process audio chunks │
│ in real-time loop │
└─────────────────────────────┘

The session is created lazily when the first element arrives, using configuration from the pipeline metadata.

Two modes are available for detecting when a speaker has finished: ASM (provider-native) and VAD (client-side).

The provider (e.g., Gemini Live API) handles turn detection internally:

duplex:
turn_detection:
mode: asm
  • Provider signals when user stops speaking
  • Simpler configuration
  • Provider-specific behavior

Client-side VAD with configurable thresholds:

duplex:
turn_detection:
mode: vad
vad:
silence_threshold_ms: 600
min_speech_ms: 200
  • Precise control over turn boundaries
  • Consistent across providers
  • Requires threshold tuning

Audio must be in raw PCM format:

ParameterValueReason
FormatRaw PCM (no headers)Direct streaming
Sample Rate16000 HzGemini requirement
Bit Depth16-bitStandard voice quality
ChannelsMonoVoice doesn’t need stereo

Audio is sent in small chunks to enable real-time processing:

Audio File (10 seconds)
┌───────────────────────────────────────┐
│ Chunk 1 │ Chunk 2 │ ... │ Chunk N │
│ (20ms) │ (20ms) │ │ (20ms) │
│ 640 B │ 640 B │ │ 640 B │
└───────────────────────────────────────┘
Streamed to provider via WebSocket

Chunk size calculation:

  • 16000 samples/second × 2 bytes/sample × 0.02 seconds = 640 bytes per 20ms chunk

Sends all audio as fast as possible:

[Chunk 1][Chunk 2][Chunk 3]...[Chunk N] ──► Provider
Response

Best for: Pre-recorded audio, avoiding false turn detections from natural pauses.

Paces audio to match actual speech timing:

[Chunk 1] ─(20ms)─► [Chunk 2] ─(20ms)─► [Chunk 3] ...

Best for: Testing real-time interaction, interruption handling.

For fully automated testing, self-play mode uses TTS to generate audio dynamically:

┌─────────────────────────────────────────────────────────────┐
│ Self-Play Turn Flow │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. Collect conversation history from state store │
│ │ │
│ ▼ │
│ 2. Send history to self-play LLM with persona prompt │
│ │ │
│ ▼ │
│ 3. LLM generates next user message (text) │
│ │ │
│ ▼ │
│ 4. TTS converts text to audio │
│ │ │
│ ▼ │
│ 5. Stream audio to duplex session │
│ │ │
│ ▼ │
│ 6. Capture and validate response │
│ │
└─────────────────────────────────────────────────────────────┘

This enables testing multi-turn voice conversations without pre-recording audio files.

Voice sessions are inherently less stable than text sessions due to:

  • Network latency variations
  • Provider-side connection limits
  • Audio processing delays
  • Turn detection edge cases
duplex:
resilience:
max_retries: 2 # Retry failed sessions
retry_delay_ms: 2000 # Wait between retries
inter_turn_delay_ms: 500 # Pause between turns
partial_success_min_turns: 2 # Accept if N turns succeed
ignore_last_turn_session_end: true

Not all tests need to complete every turn. For exploratory testing:

resilience:
partial_success_min_turns: 3 # Success if 3+ turns complete

This allows testing to continue even when sessions end early.

Both Arena and the SDK use the same underlying runtime for duplex streaming:

AspectArenaSDK
Pipeline builderInternalConfigurable
Session lifecycleManaged by executorManaged by application
State storageArena state storeApplication-provided
Use caseAutomated testingProduction applications

The runtime/streaming package provides shared utilities for both.

The pipeline runs before session creation because:

  1. Consistency: Same pattern as non-duplex pipelines
  2. Flexibility: Prompt assembly can vary per scenario
  3. Validation: Guardrails apply to all response types
  4. Debugging: Each stage can be inspected independently

Sessions are created when the first audio arrives because:

  1. Configuration: System prompt comes from pipeline metadata
  2. Resource efficiency: Don’t create sessions that won’t be used
  3. Error handling: Pipeline errors caught before session cost

Provider turn detection can trigger mid-utterance with natural speech pauses. Burst mode sends all audio before any turn detection occurs, preventing “user interrupted” false positives.