Tutorial 7: Audio Sessions

Build voice-enabled conversations with real-time audio streaming.

What You’ll Learn

Prerequisites

Two Modes for Voice

PromptKit supports two modes for voice conversations:

ModeDescriptionUse Case
VADVoice Activity Detection pipelineStandard LLMs (GPT-4, Claude) with separate STT/TTS
ASMAudio Streaming ModelNative multimodal LLMs (Gemini Live)

VAD Mode: Voice with Text LLMs

VAD mode enables voice conversations with any text-based LLM by adding STT (speech-to-text) and TTS (text-to-speech) stages to the pipeline.

Pipeline Flow

Audio Input → [AudioTurn] → [STT] → [LLM] → [TTS] → Audio Output

Step 1: Create the Pack

Create voice-assistant.pack.json:

{
  "id": "voice-assistant",
  "name": "Voice Assistant",
  "version": "1.0.0",
  "template_engine": {
    "version": "v1",
    "syntax": "{{variable}}"
  },
  "prompts": {
    "assistant": {
      "id": "assistant",
      "name": "Voice Assistant",
      "version": "1.0.0",
      "system_template": "You are a helpful voice assistant. Keep responses concise and natural for spoken conversation. The user's name is {{user_name}}.",
      "parameters": {
        "temperature": 0.7,
        "max_tokens": 150
      }
    }
  }
}

Step 2: Configure VAD Mode

package main

import (
    "context"
    "log"
    "os"

    "github.com/AltairaLabs/PromptKit/sdk"
    "github.com/AltairaLabs/PromptKit/runtime/stt"
    "github.com/AltairaLabs/PromptKit/runtime/tts"
)

func main() {
    // Create STT and TTS services
    sttService := stt.NewOpenAI(os.Getenv("OPENAI_API_KEY"))
    ttsService := tts.NewOpenAI(os.Getenv("OPENAI_API_KEY"))

    // Open duplex conversation with VAD mode
    conv, err := sdk.OpenDuplex("./voice-assistant.pack.json", "assistant",
        sdk.WithVADMode(sttService, ttsService, sdk.DefaultVADModeConfig()),
    )
    if err != nil {
        log.Fatal(err)
    }
    defer conv.Close()

    conv.SetVar("user_name", "Alice")

    // Start audio processing
    ctx := context.Background()
    audioIn, audioOut, err := conv.StartAudio(ctx)
    if err != nil {
        log.Fatal(err)
    }

    // Feed audio from microphone and play output
    // (See complete example for audio I/O implementation)
}

Step 3: Customize VAD Configuration

// Custom VAD settings for different environments
vadConfig := &sdk.VADModeConfig{
    SilenceDuration:   500 * time.Millisecond, // Shorter pause detection
    MinSpeechDuration: 100 * time.Millisecond, // Faster response
    MaxTurnDuration:   20 * time.Second,       // Shorter max turn
    SampleRate:        16000,                  // 16kHz audio
    Language:          "en",                   // English
    Voice:             "nova",                 // TTS voice
    Speed:             1.1,                    // Slightly faster speech
}

conv, _ := sdk.OpenDuplex("./voice-assistant.pack.json", "assistant",
    sdk.WithVADMode(sttService, ttsService, vadConfig),
)

VAD Configuration Options

OptionDefaultDescription
SilenceDuration800msSilence required to detect turn end
MinSpeechDuration200msMinimum speech before turn can complete
MaxTurnDuration30sMaximum turn length
SampleRate16000Audio sample rate in Hz
Language”en”Language hint for STT
Voice”alloy”TTS voice ID
Speed1.0TTS speech rate (0.5-2.0)

ASM Mode: Native Audio LLMs

ASM (Audio Streaming Model) mode is for LLMs with native bidirectional audio support, like Gemini Live API. Audio streams directly to and from the model without separate STT/TTS stages.

Pipeline Flow

Audio/Text Input → [DuplexProvider] → Audio/Text Output

Step 1: Configure ASM Mode

package main

import (
    "context"
    "log"
    "os"

    "github.com/AltairaLabs/PromptKit/sdk"
    "github.com/AltairaLabs/PromptKit/runtime/providers"
    "github.com/AltairaLabs/PromptKit/runtime/providers/gemini"
    "github.com/AltairaLabs/PromptKit/runtime/types"
)

func main() {
    ctx := context.Background()

    // Create Gemini streaming session
    session, err := gemini.NewStreamSession(ctx,
        "wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent",
        os.Getenv("GEMINI_API_KEY"),
        &providers.StreamingInputConfig{
            Type:       types.ContentTypeAudio,
            SampleRate: 16000,
            Channels:   1,
        },
    )
    if err != nil {
        log.Fatal(err)
    }

    // Open duplex conversation with ASM mode
    conv, err := sdk.OpenDuplex("./voice-assistant.pack.json", "assistant",
        sdk.WithStreamingConfig(&providers.StreamingInputConfig{
            Type:       types.ContentTypeAudio,
            SampleRate: 16000,
            Channels:   1,
        }),
    )
    if err != nil {
        log.Fatal(err)
    }
    defer conv.Close()

    // Start streaming
    audioIn, audioOut, err := conv.StartAudio(ctx)
    if err != nil {
        log.Fatal(err)
    }

    // Stream audio bidirectionally
    // ...
}

Handling Audio Streams

Sending Audio

// Send audio chunks from microphone
go func() {
    for {
        chunk := readFromMicrophone() // Your audio capture code
        select {
        case audioIn <- chunk:
        case <-ctx.Done():
            return
        }
    }
}()

Receiving Audio

// Play audio output
go func() {
    for chunk := range audioOut {
        if chunk.Error != nil {
            log.Printf("Audio error: %v", chunk.Error)
            continue
        }
        playAudio(chunk.Data) // Your audio playback code
    }
}()

Turn Detection

VAD mode uses turn detection to determine when the user has finished speaking.

Default Behavior

The default turn detector uses silence duration:

Custom Turn Detector

import "github.com/AltairaLabs/PromptKit/runtime/audio"

// Create custom turn detector
detector := audio.NewSilenceDetector(audio.SilenceConfig{
    SilenceThreshold:  500 * time.Millisecond,
    MinSpeechDuration: 100 * time.Millisecond,
})

conv, _ := sdk.OpenDuplex("./voice-assistant.pack.json", "assistant",
    sdk.WithTurnDetector(detector),
    sdk.WithVADMode(sttService, ttsService, nil),
)

Interruption Handling

Users can interrupt the assistant while it’s speaking (barge-in).

// The TTSStageWithInterruption stage handles this automatically
// When speech is detected during TTS output:
// 1. TTS playback stops
// 2. New user speech is processed
// 3. Assistant responds to the interruption

Complete Example: Voice Interview

See the sdk/examples/voice-interview/ directory for a complete working example that includes:

Choosing Between VAD and ASM

ConsiderationVAD ModeASM Mode
LLM SupportAny text LLMGemini Live only
LatencyHigher (STT + TTS overhead)Lower (native audio)
FlexibilityMore control over STT/TTSLess customization
CostSeparate STT/TTS costsSingle API cost
QualityDepends on STT/TTS providersNative quality

Best Practices

  1. Sample Rate: Use 16kHz for best compatibility
  2. Buffer Size: Keep audio buffers small for low latency
  3. Error Handling: Always handle audio errors gracefully
  4. Cleanup: Close conversations properly to release resources
  5. Testing: Test with various audio inputs and network conditions

What’s Next

See Also