Duplex Configuration Reference

Complete reference for configuring duplex (bidirectional) streaming scenarios in PromptArena.

Overview

Duplex mode enables real-time bidirectional audio streaming for testing voice assistants and conversational AI. When enabled, audio is streamed in chunks and turn boundaries are detected dynamically.

Requires: Gemini Live API (provider type: gemini, model: gemini-2.0-flash-exp or similar)


Scenario Configuration

Enable duplex mode by adding the duplex field to your scenario spec:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: voice-assistant-test
spec:
  id: voice-assistant-test
  task_type: voice-assistant
  streaming: true  # Required for duplex

  duplex:
    timeout: "5m"
    turn_detection:
      mode: asm
    resilience:
      max_retries: 2
      partial_success_min_turns: 2

DuplexConfig

The main duplex configuration object.

FieldTypeDefaultDescription
timeoutstring"10m"Maximum session duration (Go duration format)
turn_detectionTurnDetectionConfigmode: asmTurn boundary detection settings
resilienceDuplexResilienceConfigSee belowError handling and retry behavior

Example

duplex:
  timeout: "5m30s"
  turn_detection:
    mode: vad
    vad:
      silence_threshold_ms: 600
      min_speech_ms: 200
  resilience:
    max_retries: 2
    inter_turn_delay_ms: 500

TurnDetectionConfig

Configures how turn boundaries are detected during duplex streaming.

FieldTypeDefaultDescription
modestring"asm"Detection mode: "vad" or "asm"
vadVADConfig-Voice activity detection settings (when mode is vad)

Turn Detection Modes

ModeNameDescription
asmProvider-NativeThe provider (Gemini) handles turn detection internally using its automatic speech detection
vadVoice Activity DetectionClient-side VAD with configurable silence thresholds

ASM Mode (Provider-Native)

duplex:
  turn_detection:
    mode: asm

Best for: Simple tests, trusting provider behavior, less configuration.

How it works: The Gemini Live API automatically detects when the speaker stops talking and triggers a response.

VAD Mode (Client-Side)

duplex:
  turn_detection:
    mode: vad
    vad:
      silence_threshold_ms: 600
      min_speech_ms: 200
      max_turn_duration_s: 60

Best for: Precise control over turn boundaries, testing interruption handling, consistent behavior across providers.


VADConfig

Voice Activity Detection configuration (used when turn_detection.mode is "vad").

FieldTypeDefaultDescription
silence_threshold_msint500Silence duration (ms) to trigger turn end
min_speech_msint1000Minimum speech duration before silence counts
max_turn_duration_sint60Force turn end after this duration (seconds)

Example

duplex:
  turn_detection:
    mode: vad
    vad:
      silence_threshold_ms: 800   # Longer silence for natural speech pauses
      min_speech_ms: 300          # Short utterances still count
      max_turn_duration_s: 30     # Limit long turns

Tuning Guidelines

Scenariosilence_threshold_msmin_speech_ms
Quick responses400-500150-200
Natural conversation600-800200-300
TTS with pauses1000-1500500-800
Slow/deliberate speech1200-2000800-1000

DuplexResilienceConfig

Error handling and retry behavior for duplex sessions.

FieldTypeDefaultDescription
max_retriesint0Retry attempts for failed turns
retry_delay_msint1000Delay between retries (ms)
inter_turn_delay_msint500Delay between turns (ms)
selfplay_inter_turn_delay_msint1000Delay after self-play turns (ms)
partial_success_min_turnsint1Minimum completed turns for partial success
ignore_last_turn_session_endbooltrueTreat session end on final turn as success

Example

duplex:
  resilience:
    max_retries: 2
    retry_delay_ms: 2000
    inter_turn_delay_ms: 500
    selfplay_inter_turn_delay_ms: 1500
    partial_success_min_turns: 3
    ignore_last_turn_session_end: true

Partial Success

When partial_success_min_turns is set, sessions that end unexpectedly after completing at least that many turns are treated as successful:

resilience:
  partial_success_min_turns: 2  # Accept if 2+ turns complete

This is useful for exploratory testing where completing all turns isnโ€™t critical.

Session End Handling

By default, if the session ends on the final expected turn, itโ€™s treated as success:

resilience:
  ignore_last_turn_session_end: true   # Default

Set to false if you need the final turn to complete normally without session termination.


TTSConfig

Text-to-speech configuration for self-play audio generation.

FieldTypeRequiredDescription
providerstringYesTTS provider: "openai", "elevenlabs", "cartesia", "mock"
voicestringYes*Voice ID for synthesis (*optional for mock with audio_files)
audio_files[]stringNoPCM audio files for mock provider (rotated through)
sample_rateintNoOutput sample rate in Hz (default: 24000)

Example: OpenAI TTS

turns:
  - role: selfplay-user
    persona: curious-customer
    turns: 3
    tts:
      provider: openai
      voice: alloy

Example: Mock TTS with Pre-recorded Audio

turns:
  - role: selfplay-user
    persona: test-persona
    turns: 3
    tts:
      provider: mock
      audio_files:
        - audio/question1.pcm
        - audio/question2.pcm
        - audio/question3.pcm
      sample_rate: 16000  # Match your file sample rate

Available OpenAI Voices

VoiceDescription
alloyNeutral, balanced
echoWarm, engaging
fableExpressive, dynamic
onyxDeep, authoritative
novaFriendly, conversational
shimmerClear, professional

Audio Turn Parts

In duplex scenarios, user turns contain audio parts instead of text:

turns:
  - role: user
    parts:
      - type: audio
        media:
          file_path: audio/greeting.pcm
          mime_type: audio/L16

Audio Requirements

ParameterValueDescription
FormatRaw PCMNo headers (not WAV)
Sample Rate16000 HzRequired by Gemini Live API
Bit Depth16-bitSigned integer
ChannelsMonoSingle channel
MIME Typeaudio/L16Linear PCM

Converting Audio Files

# WAV to PCM
ffmpeg -i input.wav -f s16le -ar 16000 -ac 1 output.pcm

# MP3 to PCM
ffmpeg -i input.mp3 -f s16le -ar 16000 -ac 1 output.pcm

# Verify format
ffprobe -show_format -show_streams output.pcm

Provider Configuration

Duplex requires a Gemini provider with streaming enabled:

# providers/gemini-live.provider.yaml
apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Provider
metadata:
  name: gemini-live

spec:
  id: gemini-live
  type: gemini
  model: gemini-2.0-flash-exp

  defaults:
    temperature: 0.7
    max_tokens: 1000

  # Gemini-specific configuration
  additional_config:
    audio_enabled: true
    response_modalities:
      - AUDIO   # Returns audio + text transcription

Response Modalities

ModalityDescription
AUDIOReturns audio response with text transcription
TEXTReturns text-only response (no audio)

Note: Gemini Live API supports only ONE modality at a time. AUDIO mode includes text transcription via outputAudioTranscription.


Complete Scenario Example

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
  name: voice-assistant-comprehensive

spec:
  id: voice-assistant-comprehensive
  task_type: voice-assistant
  description: "Full duplex voice assistant test with self-play"
  streaming: true

  duplex:
    timeout: "5m"
    turn_detection:
      mode: vad
      vad:
        silence_threshold_ms: 800
        min_speech_ms: 250
        max_turn_duration_s: 45
    resilience:
      max_retries: 2
      retry_delay_ms: 2000
      inter_turn_delay_ms: 500
      selfplay_inter_turn_delay_ms: 1200
      partial_success_min_turns: 3
      ignore_last_turn_session_end: true

  turns:
    # Initial audio greeting
    - role: user
      parts:
        - type: audio
          media:
            file_path: audio/greeting.pcm
            mime_type: audio/L16
      assertions:
        - type: content_matches
          params:
            pattern: "(?i)(hello|hi|welcome)"

    # Self-play generates follow-up questions
    - role: selfplay-user
      persona: curious-customer
      turns: 3
      tts:
        provider: openai
        voice: nova
      assertions:
        - type: content_matches
          params:
            pattern: ".{20,}"  # At least 20 chars

  conversation_assertions:
    - type: content_includes_any
      params:
        patterns:
          - "help"
          - "assist"
          - "support"

Validation Errors

Common configuration errors and solutions:

ErrorCauseSolution
invalid duplex timeout formatTimeout not in Go duration formatUse format like "5m", "30s", "1h30m"
invalid turn detection modeMode not vad or asmUse mode: vad or mode: asm
silence_threshold_ms must be non-negativeNegative VAD thresholdUse positive values
tts provider is requiredMissing TTS providerAdd provider: openai or similar
tts voice is requiredMissing voice IDAdd voice: alloy or similar

See Also