Voice Activity Detection (VAD) Demo

This example demonstrates Voice Activity Detection using PromptKit’s audio package.

Features

SimpleVAD: Basic voice activity detection using RMS energy analysis
State Tracking: Monitor transitions between quiet/starting/speaking/stopping
Configurable Parameters: Tune sensitivity for different environments
Event Notifications: React to state changes in real-time

Running

cd sdk/examples/vad-demo
go run .

This example runs with simulated audio data - no microphone required.

VAD States

State	Description
`quiet`	No voice activity detected
`starting`	Voice beginning (within start threshold)
`speaking`	Active speech detected
`stopping`	Voice ending (within stop threshold)

Configuration

Default Parameters

params := audio.DefaultVADParams()
// Confidence: 0.5
// StartSecs: 0.2
// StopSecs: 0.8
// MinVolume: 0.01
// SampleRate: 16000

Strict VAD (noisy environments)

params := audio.VADParams{
    Confidence: 0.7,   // Higher confidence required
    StartSecs:  0.3,   // Longer speech to trigger
    StopSecs:   1.2,   // Allow longer pauses
    MinVolume:  0.02,  // Higher volume threshold
    SampleRate: 16000,
}

Sensitive VAD (quiet environments)

params := audio.VADParams{
    Confidence: 0.3,   // More sensitive
    StartSecs:  0.1,   // Quick start detection
    StopSecs:   0.5,   // Quick end detection
    MinVolume:  0.005, // Detect quiet speech
    SampleRate: 16000,
}

State Change Events

vad, _ := audio.NewSimpleVAD(audio.DefaultVADParams())
stateChanges := vad.OnStateChange()

go func() {
    for event := range stateChanges {
        fmt.Printf("State: %s -> %s (confidence: %.2f)\n",
            event.PrevState, event.State, event.Confidence)
    }
}()

Integration with SDK

VAD is typically used with audio sessions:

conv, _ := sdk.Open("./pack.json", "assistant")

// Create audio session with VAD
session, _ := conv.OpenAudioSession(ctx,
    sdk.WithSessionVAD(audio.NewSimpleVAD(audio.DefaultVADParams())),
)

// VAD automatically processes audio chunks
session.SendChunk(ctx, audioChunk)

Notes

VAD is energy-based (RMS volume analysis)
Works with 16-bit PCM audio at configurable sample rates
Default sample rate is 16kHz (common for speech recognition)
Transition thresholds prevent false positives from brief sounds