Skip to content

Tutorial 7: Audio Sessions

Build voice-enabled conversations with real-time audio streaming.

  • Two modes for voice conversations: VAD and ASM
  • Setting up VAD mode with STT and TTS services
  • Setting up ASM mode for native audio LLMs
  • Handling audio input and output streams
  • Turn detection and interruption handling

PromptKit supports two modes for voice conversations:

ModeDescriptionUse Case
VADVoice Activity Detection pipelineStandard LLMs (GPT-4, Claude) with separate STT/TTS
ASMAudio Streaming ModelNative multimodal LLMs (Gemini Live)

VAD mode enables voice conversations with any text-based LLM by adding STT (speech-to-text) and TTS (text-to-speech) stages to the pipeline.

Audio Input → [AudioTurn] → [STT] → [LLM] → [TTS] → Audio Output

Create voice-assistant.pack.json:

{
"id": "voice-assistant",
"name": "Voice Assistant",
"version": "1.0.0",
"template_engine": {
"version": "v1",
"syntax": "{{variable}}"
},
"prompts": {
"assistant": {
"id": "assistant",
"name": "Voice Assistant",
"version": "1.0.0",
"system_template": "You are a helpful voice assistant. Keep responses concise and natural for spoken conversation. The user's name is {{user_name}}.",
"parameters": {
"temperature": 0.7,
"max_tokens": 150
}
}
}
}
package main
import (
"context"
"log"
"os"
"github.com/AltairaLabs/PromptKit/sdk"
"github.com/AltairaLabs/PromptKit/runtime/stt"
"github.com/AltairaLabs/PromptKit/runtime/tts"
)
func main() {
// Create STT and TTS services
sttService := stt.NewOpenAI(os.Getenv("OPENAI_API_KEY"))
ttsService := tts.NewOpenAI(os.Getenv("OPENAI_API_KEY"))
// Open duplex conversation with VAD mode
conv, err := sdk.OpenDuplex("./voice-assistant.pack.json", "assistant",
sdk.WithVADMode(sttService, ttsService, sdk.DefaultVADModeConfig()),
)
if err != nil {
log.Fatal(err)
}
defer conv.Close()
conv.SetVar("user_name", "Alice")
// Start audio processing
ctx := context.Background()
audioIn, audioOut, err := conv.StartAudio(ctx)
if err != nil {
log.Fatal(err)
}
// Feed audio from microphone and play output
// (See complete example for audio I/O implementation)
}
// Custom VAD settings for different environments
vadConfig := &sdk.VADModeConfig{
SilenceDuration: 500 * time.Millisecond, // Shorter pause detection
MinSpeechDuration: 100 * time.Millisecond, // Faster response
MaxTurnDuration: 20 * time.Second, // Shorter max turn
SampleRate: 16000, // 16kHz audio
Language: "en", // English
Voice: "nova", // TTS voice
Speed: 1.1, // Slightly faster speech
}
conv, _ := sdk.OpenDuplex("./voice-assistant.pack.json", "assistant",
sdk.WithVADMode(sttService, ttsService, vadConfig),
)
OptionDefaultDescription
SilenceDuration800msSilence required to detect turn end
MinSpeechDuration200msMinimum speech before turn can complete
MaxTurnDuration30sMaximum turn length
SampleRate16000Audio sample rate in Hz
Language”en”Language hint for STT
Voice”alloy”TTS voice ID
Speed1.0TTS speech rate (0.5-2.0)

ASM (Audio Streaming Model) mode is for LLMs with native bidirectional audio support, like Gemini Live API. Audio streams directly to and from the model without separate STT/TTS stages.

Audio/Text Input → [DuplexProvider] → Audio/Text Output
package main
import (
"context"
"log"
"os"
"github.com/AltairaLabs/PromptKit/sdk"
"github.com/AltairaLabs/PromptKit/runtime/providers"
"github.com/AltairaLabs/PromptKit/runtime/providers/gemini"
"github.com/AltairaLabs/PromptKit/runtime/types"
)
func main() {
ctx := context.Background()
// Create Gemini streaming session
session, err := gemini.NewStreamSession(ctx,
"wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent",
os.Getenv("GEMINI_API_KEY"),
&providers.StreamingInputConfig{
Type: types.ContentTypeAudio,
SampleRate: 16000,
Channels: 1,
},
)
if err != nil {
log.Fatal(err)
}
// Open duplex conversation with ASM mode
conv, err := sdk.OpenDuplex("./voice-assistant.pack.json", "assistant",
sdk.WithStreamingConfig(&providers.StreamingInputConfig{
Type: types.ContentTypeAudio,
SampleRate: 16000,
Channels: 1,
}),
)
if err != nil {
log.Fatal(err)
}
defer conv.Close()
// Start streaming
audioIn, audioOut, err := conv.StartAudio(ctx)
if err != nil {
log.Fatal(err)
}
// Stream audio bidirectionally
// ...
}
// Send audio chunks from microphone
go func() {
for {
chunk := readFromMicrophone() // Your audio capture code
select {
case audioIn <- chunk:
case <-ctx.Done():
return
}
}
}()
// Play audio output
go func() {
for chunk := range audioOut {
if chunk.Error != nil {
log.Printf("Audio error: %v", chunk.Error)
continue
}
playAudio(chunk.Data) // Your audio playback code
}
}()

VAD mode uses turn detection to determine when the user has finished speaking.

The default turn detector uses silence duration:

  • Waits for SilenceDuration of quiet
  • Requires at least MinSpeechDuration of speech first
  • Forces completion after MaxTurnDuration
import "github.com/AltairaLabs/PromptKit/runtime/audio"
// Create custom turn detector
detector := audio.NewSilenceDetector(audio.SilenceConfig{
SilenceThreshold: 500 * time.Millisecond,
MinSpeechDuration: 100 * time.Millisecond,
})
conv, _ := sdk.OpenDuplex("./voice-assistant.pack.json", "assistant",
sdk.WithTurnDetector(detector),
sdk.WithVADMode(sttService, ttsService, nil),
)

Users can interrupt the assistant while it’s speaking (barge-in).

// The TTSStageWithInterruption stage handles this automatically
// When speech is detected during TTS output:
// 1. TTS playback stops
// 2. New user speech is processed
// 3. Assistant responds to the interruption

See the sdk/examples/voice-interview/ directory for a complete working example that includes:

  • Audio capture from microphone
  • Real-time speech processing
  • TTS audio playback
  • Turn management
  • Interruption handling
ConsiderationVAD ModeASM Mode
LLM SupportAny text LLMGemini Live only
LatencyHigher (STT + TTS overhead)Lower (native audio)
FlexibilityMore control over STT/TTSLess customization
CostSeparate STT/TTS costsSingle API cost
QualityDepends on STT/TTS providersNative quality
  1. Sample Rate: Use 16kHz for best compatibility
  2. Buffer Size: Keep audio buffers small for low latency
  3. Error Handling: Always handle audio errors gracefully
  4. Cleanup: Close conversations properly to release resources
  5. Testing: Test with various audio inputs and network conditions