Duplex Streaming Architecture

Understanding how PromptArena handles bidirectional audio streaming for voice assistant testing.

What is Duplex Streaming?

Duplex streaming enables real-time bidirectional communication between your test scenario and an LLM provider. Unlike traditional request-response patterns, duplex streaming:

This mirrors how real voice assistants work, making it essential for testing voice interfaces.

Traditional vs Duplex Audio Testing

Traditional Audio Testing

┌──────────────────────────────────────────────────┐
│ 1. Load entire audio file                        │
│ 2. Send as single blob to provider               │
│ 3. Wait for complete transcription               │
│ 4. Get text response                             │
│ 5. Move to next turn                             │
└──────────────────────────────────────────────────┘

Limitations:

Duplex Audio Testing

┌──────────────────────────────────────────────────┐
│ 1. Open WebSocket session                        │
│ 2. Stream audio chunks (640 bytes = 20ms)        │
│ 3. Provider detects speech/silence boundaries    │
│ 4. Receive audio/text response in real-time      │
│ 5. Continue streaming more input                 │
└──────────────────────────────────────────────────┘

Benefits:

Pipeline Architecture

Duplex testing uses the same pipeline architecture as non-duplex, with specialized stages:

┌─────────────────────────────────────────────────────────────┐
│                     Streaming Pipeline                       │
├─────────────────────────────────────────────────────────────┤
│  AudioTurnStage ──► PromptAssemblyStage ──► DuplexProvider  │
│       (VAD)              (system prompt)         Stage       │
│                                                    │         │
│                                              ◄─────┘         │
│                                        (WebSocket to Gemini) │
│                                              │               │
│  MediaExternalizer ◄── ValidationStage ◄─────┘              │
│       Stage              (guardrails)                        │
│         │                                                    │
│         ▼                                                    │
│  ArenaStateStoreSaveStage ──► Results                       │
└─────────────────────────────────────────────────────────────┘

Key Pipeline Stages

StagePurpose
AudioTurnStageOptional client-side VAD for turn detection
PromptAssemblyStageLoads prompt config, adds system instruction to metadata
DuplexProviderStageCreates WebSocket session, handles bidirectional I/O
MediaExternalizerStageSaves audio responses to files
ValidationStageRuns assertions and guardrails on responses
ArenaStateStoreSaveStagePersists messages for reporting

Session Lifecycle

Session Creation

Unlike traditional pipelines where each turn creates a new request, duplex maintains a persistent session:

First Audio Chunk Arrives


┌─────────────────────────────┐
│ Extract system_prompt from  │
│ element metadata            │
└─────────────────────────────┘


┌─────────────────────────────┐
│ Create WebSocket session    │
│ with system instruction     │
└─────────────────────────────┘


┌─────────────────────────────┐
│ Process audio chunks        │
│ in real-time loop           │
└─────────────────────────────┘

The session is created lazily when the first element arrives, using configuration from the pipeline metadata.

Turn Detection

Two modes are available for detecting when a speaker has finished:

ASM Mode (Provider-Native)

The provider (e.g., Gemini Live API) handles turn detection internally:

duplex:
  turn_detection:
    mode: asm

VAD Mode (Voice Activity Detection)

Client-side VAD with configurable thresholds:

duplex:
  turn_detection:
    mode: vad
    vad:
      silence_threshold_ms: 600
      min_speech_ms: 200

Audio Processing

Input Audio Format

Audio must be in raw PCM format:

ParameterValueReason
FormatRaw PCM (no headers)Direct streaming
Sample Rate16000 HzGemini requirement
Bit Depth16-bitStandard voice quality
ChannelsMonoVoice doesn’t need stereo

Chunk Streaming

Audio is sent in small chunks to enable real-time processing:

Audio File (10 seconds)


┌───────────────────────────────────────┐
│  Chunk 1  │  Chunk 2  │ ... │ Chunk N │
│  (20ms)   │  (20ms)   │     │  (20ms) │
│  640 B    │  640 B    │     │  640 B  │
└───────────────────────────────────────┘


Streamed to provider via WebSocket

Chunk size calculation:

Burst Mode vs Real-time Mode

Burst Mode (Default for Testing)

Sends all audio as fast as possible:

[Chunk 1][Chunk 2][Chunk 3]...[Chunk N] ──► Provider


                                         Response

Best for: Pre-recorded audio, avoiding false turn detections from natural pauses.

Real-time Mode

Paces audio to match actual speech timing:

[Chunk 1] ─(20ms)─► [Chunk 2] ─(20ms)─► [Chunk 3] ...

Best for: Testing real-time interaction, interruption handling.

Self-Play with TTS

For fully automated testing, self-play mode generates audio dynamically:

┌─────────────────────────────────────────────────────────────┐
│                    Self-Play Turn Flow                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. Collect conversation history from state store           │
│                          │                                   │
│                          ▼                                   │
│  2. Send history to self-play LLM with persona prompt        │
│                          │                                   │
│                          ▼                                   │
│  3. LLM generates next user message (text)                  │
│                          │                                   │
│                          ▼                                   │
│  4. TTS converts text to audio                              │
│                          │                                   │
│                          ▼                                   │
│  5. Stream audio to duplex session                          │
│                          │                                   │
│                          ▼                                   │
│  6. Capture and validate response                           │
│                                                              │
└─────────────────────────────────────────────────────────────┘

This enables testing multi-turn voice conversations without pre-recording audio files.

Error Handling and Resilience

Voice sessions are inherently less stable than text sessions due to:

Resilience Configuration

duplex:
  resilience:
    max_retries: 2              # Retry failed sessions
    retry_delay_ms: 2000        # Wait between retries
    inter_turn_delay_ms: 500    # Pause between turns
    partial_success_min_turns: 2 # Accept if N turns succeed
    ignore_last_turn_session_end: true

Partial Success

Not all tests need to complete every turn. For exploratory testing:

resilience:
  partial_success_min_turns: 3  # Success if 3+ turns complete

This allows testing to continue even when sessions end early.

Comparison with SDK Duplex

Both Arena and the SDK use the same underlying runtime for duplex streaming:

AspectArenaSDK
Pipeline builderInternalConfigurable
Session lifecycleManaged by executorManaged by application
State storageArena state storeApplication-provided
Use caseAutomated testingProduction applications

The runtime/streaming package provides shared utilities for both.

Design Decisions

Why Pipeline-First Architecture?

The pipeline runs before session creation because:

  1. Consistency: Same pattern as non-duplex pipelines
  2. Flexibility: Prompt assembly can vary per scenario
  3. Validation: Guardrails apply to all response types
  4. Debugging: Each stage can be inspected independently

Why Lazy Session Creation?

Sessions are created when the first audio arrives because:

  1. Configuration: System prompt comes from pipeline metadata
  2. Resource efficiency: Don’t create sessions that won’t be used
  3. Error handling: Pipeline errors caught before session cost

Why Burst Mode for Pre-recorded Audio?

Provider turn detection can trigger mid-utterance with natural speech pauses. Burst mode sends all audio before any turn detection occurs, preventing “user interrupted” false positives.

See Also