Tutorial 8: TTS Integration

Add text-to-speech capabilities to your conversations.

What You’ll Learn

Setting up TTS providers (OpenAI, ElevenLabs, Cartesia)
Configuring voice, speed, and audio formats
Streaming vs single-shot synthesis
Integrating TTS with conversations

Prerequisites

Completed Tutorial 7: Audio Sessions
API key for a TTS provider

TTS Service Interface

All TTS providers implement the same interface:

type Service interface {
    Name() string
    Synthesize(ctx context.Context, text string, config SynthesisConfig) (io.ReadCloser, error)
    SupportedVoices() []Voice
    SupportedFormats() []AudioFormat
}

Setting Up TTS Providers

OpenAI TTS

import "github.com/AltairaLabs/PromptKit/runtime/tts"

// Create OpenAI TTS service
ttsService := tts.NewOpenAI(os.Getenv("OPENAI_API_KEY"))

// Available voices: alloy, echo, fable, onyx, nova, shimmer
// Available models: tts-1 (fast), tts-1-hd (high quality)

ElevenLabs TTS

import "github.com/AltairaLabs/PromptKit/runtime/tts"

// Create ElevenLabs TTS service
ttsService := tts.NewElevenLabs(os.Getenv("ELEVENLABS_API_KEY"))

// Wide variety of voices available
// Check SupportedVoices() for options

Cartesia TTS

import "github.com/AltairaLabs/PromptKit/runtime/tts"

// Create Cartesia TTS service
ttsService := tts.NewCartesia(os.Getenv("CARTESIA_API_KEY"))

// Supports interactive streaming mode for low latency

Synthesis Configuration

config := tts.SynthesisConfig{
    Voice:    "nova",      // Voice ID
    Format:   tts.FormatMP3, // Output format
    Speed:    1.0,         // Speech rate (0.25-4.0)
    Pitch:    0,           // Pitch adjustment (-20 to 20)
    Language: "en-US",     // Language code
    Model:    "tts-1-hd",  // Model (provider-specific)
}

Available Formats

Format	Constant	Use Case
MP3	`tts.FormatMP3`	Most compatible
Opus	`tts.FormatOpus`	Best for streaming
AAC	`tts.FormatAAC`	Apple devices
FLAC	`tts.FormatFLAC`	Lossless quality
PCM	`tts.FormatPCM16`	Raw audio processing
WAV	`tts.FormatWAV`	PCM with header

Basic Synthesis

Single-Shot Synthesis

ctx := context.Background()

config := tts.SynthesisConfig{
    Voice:  "alloy",
    Format: tts.FormatMP3,
    Speed:  1.0,
}

// Synthesize text to audio
reader, err := ttsService.Synthesize(ctx, "Hello, how can I help you?", config)
if err != nil {
    log.Fatal(err)
}
defer reader.Close()

// Read audio data
audioData, _ := io.ReadAll(reader)
// Play or save audioData...

Streaming Synthesis

For lower latency, use streaming synthesis (if supported):

// Check if provider supports streaming
streamingService, ok := ttsService.(tts.StreamingService)
if !ok {
    log.Fatal("Provider doesn't support streaming")
}

// Start streaming synthesis
chunks, err := streamingService.SynthesizeStream(ctx, "Hello, how can I help you?", config)
if err != nil {
    log.Fatal(err)
}

// Process chunks as they arrive
for chunk := range chunks {
    if chunk.Error != nil {
        log.Printf("Error: %v", chunk.Error)
        break
    }
    playAudioChunk(chunk.Data)
    if chunk.Final {
        break
    }
}

Integrating with Conversations

VAD Mode with TTS

sttService := stt.NewOpenAI(os.Getenv("OPENAI_API_KEY"))
ttsService := tts.NewOpenAI(os.Getenv("OPENAI_API_KEY"))

// Configure VAD mode with custom TTS settings
vadConfig := &sdk.VADModeConfig{
    Voice: "nova",  // Use Nova voice
    Speed: 1.1,     // Slightly faster
}

conv, _ := sdk.OpenDuplex("./assistant.pack.json", "voice",
    sdk.WithVADMode(sttService, ttsService, vadConfig),
)

Manual TTS in Text Mode

// Open text conversation
conv, _ := sdk.Open("./assistant.pack.json", "chat")

// Send message and get response
resp, _ := conv.Send(ctx, "Tell me a joke")

// Synthesize the response
ttsService := tts.NewOpenAI(os.Getenv("OPENAI_API_KEY"))
reader, _ := ttsService.Synthesize(ctx, resp.Text(), tts.DefaultSynthesisConfig())
defer reader.Close()

// Play the audio
audioData, _ := io.ReadAll(reader)
playAudio(audioData)

Voice Selection

Listing Available Voices

voices := ttsService.SupportedVoices()
for _, voice := range voices {
    fmt.Printf("%s: %s (%s, %s)\n",
        voice.ID,
        voice.Name,
        voice.Language,
        voice.Gender,
    )
}

Voice Characteristics (OpenAI)

Voice	Character
alloy	Neutral, versatile
echo	Warm, smooth
fable	Expressive, British
onyx	Deep, authoritative
nova	Friendly, youthful
shimmer	Clear, professional

Error Handling

reader, err := ttsService.Synthesize(ctx, text, config)
if err != nil {
    switch {
    case errors.Is(err, tts.ErrInvalidVoice):
        log.Printf("Voice '%s' not supported", config.Voice)
    case errors.Is(err, tts.ErrRateLimited):
        log.Printf("Rate limited, retrying...")
        time.Sleep(time.Second)
        // Retry...
    case errors.Is(err, tts.ErrTextTooLong):
        log.Printf("Text exceeds maximum length")
    default:
        log.Printf("Synthesis failed: %v", err)
    }
    return
}

Performance Optimization

Caching

For repeated phrases, cache synthesized audio:

var cache = make(map[string][]byte)

func synthesizeWithCache(text string, config tts.SynthesisConfig) ([]byte, error) {
    key := text + config.Voice + config.Format.Name
    if cached, ok := cache[key]; ok {
        return cached, nil
    }

    reader, err := ttsService.Synthesize(ctx, text, config)
    if err != nil {
        return nil, err
    }
    defer reader.Close()

    data, err := io.ReadAll(reader)
    if err != nil {
        return nil, err
    }

    cache[key] = data
    return data, nil
}

Pre-synthesis

For common responses, synthesize in advance:

greetings := []string{
    "Hello! How can I help you today?",
    "I'm sorry, I didn't catch that.",
    "Is there anything else I can help with?",
}

for _, text := range greetings {
    reader, _ := ttsService.Synthesize(ctx, text, config)
    data, _ := io.ReadAll(reader)
    reader.Close()
    cache[text] = data
}

Best Practices

Voice Consistency: Use the same voice throughout a conversation
Speed Adjustment: Slower for complex info, faster for casual chat
Format Selection: Use Opus for streaming, MP3 for storage
Error Handling: Gracefully handle synthesis failures
Resource Cleanup: Always close readers when done

Cost Considerations

Provider	Pricing Model
OpenAI	Per character
ElevenLabs	Per character (tiers)
Cartesia	Per character

Estimate costs before production deployment.

What’s Next

Tutorial 9: Variable Providers - Dynamic context injection