Tutorial 8: TTS Integration
Add text-to-speech capabilities to your conversations.
What You’ll Learn
Section titled “What You’ll Learn”- Setting up TTS providers (OpenAI, ElevenLabs, Cartesia)
- Configuring voice, speed, and audio formats
- Streaming vs single-shot synthesis
- Integrating TTS with conversations
Prerequisites
Section titled “Prerequisites”- Completed Tutorial 7: Audio Sessions
- API key for a TTS provider
TTS Service Interface
Section titled “TTS Service Interface”All TTS providers implement the same interface:
type Service interface { Name() string Synthesize(ctx context.Context, text string, config SynthesisConfig) (io.ReadCloser, error) SupportedVoices() []Voice SupportedFormats() []AudioFormat}Setting Up TTS Providers
Section titled “Setting Up TTS Providers”OpenAI TTS
Section titled “OpenAI TTS”import "github.com/AltairaLabs/PromptKit/runtime/tts"
// Create OpenAI TTS servicettsService := tts.NewOpenAI(os.Getenv("OPENAI_API_KEY"))
// Available voices: alloy, echo, fable, onyx, nova, shimmer// Available models: tts-1 (fast), tts-1-hd (high quality)ElevenLabs TTS
Section titled “ElevenLabs TTS”import "github.com/AltairaLabs/PromptKit/runtime/tts"
// Create ElevenLabs TTS servicettsService := tts.NewElevenLabs(os.Getenv("ELEVENLABS_API_KEY"))
// Wide variety of voices available// Check SupportedVoices() for optionsCartesia TTS
Section titled “Cartesia TTS”import "github.com/AltairaLabs/PromptKit/runtime/tts"
// Create Cartesia TTS servicettsService := tts.NewCartesia(os.Getenv("CARTESIA_API_KEY"))
// Supports interactive streaming mode for low latencySynthesis Configuration
Section titled “Synthesis Configuration”config := tts.SynthesisConfig{ Voice: "nova", // Voice ID Format: tts.FormatMP3, // Output format Speed: 1.0, // Speech rate (0.25-4.0) Pitch: 0, // Pitch adjustment (-20 to 20) Language: "en-US", // Language code Model: "tts-1-hd", // Model (provider-specific)}Available Formats
Section titled “Available Formats”| Format | Constant | Use Case |
|---|---|---|
| MP3 | tts.FormatMP3 | Most compatible |
| Opus | tts.FormatOpus | Best for streaming |
| AAC | tts.FormatAAC | Apple devices |
| FLAC | tts.FormatFLAC | Lossless quality |
| PCM | tts.FormatPCM16 | Raw audio processing |
| WAV | tts.FormatWAV | PCM with header |
Basic Synthesis
Section titled “Basic Synthesis”Single-Shot Synthesis
Section titled “Single-Shot Synthesis”ctx := context.Background()
config := tts.SynthesisConfig{ Voice: "alloy", Format: tts.FormatMP3, Speed: 1.0,}
// Synthesize text to audioreader, err := ttsService.Synthesize(ctx, "Hello, how can I help you?", config)if err != nil { log.Fatal(err)}defer reader.Close()
// Read audio dataaudioData, _ := io.ReadAll(reader)// Play or save audioData...Streaming Synthesis
Section titled “Streaming Synthesis”For lower latency, use streaming synthesis (if supported):
// Check if provider supports streamingstreamingService, ok := ttsService.(tts.StreamingService)if !ok { log.Fatal("Provider doesn't support streaming")}
// Start streaming synthesischunks, err := streamingService.SynthesizeStream(ctx, "Hello, how can I help you?", config)if err != nil { log.Fatal(err)}
// Process chunks as they arrivefor chunk := range chunks { if chunk.Error != nil { log.Printf("Error: %v", chunk.Error) break } playAudioChunk(chunk.Data) if chunk.Final { break }}Integrating with Conversations
Section titled “Integrating with Conversations”VAD Mode with TTS
Section titled “VAD Mode with TTS”sttService := stt.NewOpenAI(os.Getenv("OPENAI_API_KEY"))ttsService := tts.NewOpenAI(os.Getenv("OPENAI_API_KEY"))
// Configure VAD mode with custom TTS settingsvadConfig := &sdk.VADModeConfig{ Voice: "nova", // Use Nova voice Speed: 1.1, // Slightly faster}
conv, _ := sdk.OpenDuplex("./assistant.pack.json", "voice", sdk.WithVADMode(sttService, ttsService, vadConfig),)Manual TTS in Text Mode
Section titled “Manual TTS in Text Mode”// Open text conversationconv, _ := sdk.Open("./assistant.pack.json", "chat")
// Send message and get responseresp, _ := conv.Send(ctx, "Tell me a joke")
// Synthesize the responsettsService := tts.NewOpenAI(os.Getenv("OPENAI_API_KEY"))reader, _ := ttsService.Synthesize(ctx, resp.Text(), tts.DefaultSynthesisConfig())defer reader.Close()
// Play the audioaudioData, _ := io.ReadAll(reader)playAudio(audioData)Voice Selection
Section titled “Voice Selection”Listing Available Voices
Section titled “Listing Available Voices”voices := ttsService.SupportedVoices()for _, voice := range voices { fmt.Printf("%s: %s (%s, %s)\n", voice.ID, voice.Name, voice.Language, voice.Gender, )}Voice Characteristics (OpenAI)
Section titled “Voice Characteristics (OpenAI)”| Voice | Character |
|---|---|
| alloy | Neutral, versatile |
| echo | Warm, smooth |
| fable | Expressive, British |
| onyx | Deep, authoritative |
| nova | Friendly, youthful |
| shimmer | Clear, professional |
Error Handling
Section titled “Error Handling”reader, err := ttsService.Synthesize(ctx, text, config)if err != nil { switch { case errors.Is(err, tts.ErrInvalidVoice): log.Printf("Voice '%s' not supported", config.Voice) case errors.Is(err, tts.ErrRateLimited): log.Printf("Rate limited, retrying...") time.Sleep(time.Second) // Retry... case errors.Is(err, tts.ErrTextTooLong): log.Printf("Text exceeds maximum length") default: log.Printf("Synthesis failed: %v", err) } return}Performance Optimization
Section titled “Performance Optimization”Caching
Section titled “Caching”For repeated phrases, cache synthesized audio:
var cache = make(map[string][]byte)
func synthesizeWithCache(text string, config tts.SynthesisConfig) ([]byte, error) { key := text + config.Voice + config.Format.Name if cached, ok := cache[key]; ok { return cached, nil }
reader, err := ttsService.Synthesize(ctx, text, config) if err != nil { return nil, err } defer reader.Close()
data, err := io.ReadAll(reader) if err != nil { return nil, err }
cache[key] = data return data, nil}Pre-synthesis
Section titled “Pre-synthesis”For common responses, synthesize in advance:
greetings := []string{ "Hello! How can I help you today?", "I'm sorry, I didn't catch that.", "Is there anything else I can help with?",}
for _, text := range greetings { reader, _ := ttsService.Synthesize(ctx, text, config) data, _ := io.ReadAll(reader) reader.Close() cache[text] = data}Best Practices
Section titled “Best Practices”- Voice Consistency: Use the same voice throughout a conversation
- Speed Adjustment: Slower for complex info, faster for casual chat
- Format Selection: Use Opus for streaming, MP3 for storage
- Error Handling: Gracefully handle synthesis failures
- Resource Cleanup: Always close readers when done
Cost Considerations
Section titled “Cost Considerations”| Provider | Pricing Model |
|---|---|
| OpenAI | Per character |
| ElevenLabs | Per character (tiers) |
| Cartesia | Per character |
Estimate costs before production deployment.
What’s Next
Section titled “What’s Next”- Tutorial 9: Variable Providers - Dynamic context injection
See Also
Section titled “See Also”- TTS API Reference - Complete API documentation
- Audio Sessions Tutorial - Full voice integration