Multimodal SDK Example
This example demonstrates multimodal (vision) capabilities using the PromptKit SDK with streaming responses.
Features
Section titled “Features”- Image Analysis: Send images with text prompts for visual analysis
- Streaming Responses: Get real-time streaming output as the model analyzes images
- Conversation Context: Follow-up questions maintain context about previously analyzed images
- Multiple Input Methods: Support for image URLs, file paths, and raw image data
Prerequisites
Section titled “Prerequisites”- A Google Gemini API key (for vision capabilities)
- Go 1.21 or later
export GEMINI_API_KEY=your-gemini-api-keyRunning the Example
Section titled “Running the Example”cd sdk/examples/multimodalgo run .How It Works
Section titled “How It Works”Opening a Multimodal Conversation
Section titled “Opening a Multimodal Conversation”conv, err := sdk.Open("./multimodal.pack.json", "vision-analyst")if err != nil { log.Fatalf("Failed to open pack: %v", err)}defer conv.Close()Streaming Image Analysis
Section titled “Streaming Image Analysis”for chunk := range conv.Stream(ctx, "What do you see in this image?", sdk.WithImageURL("https://example.com/image.jpg"),) { if chunk.Error != nil { log.Printf("Error: %v", chunk.Error) break } if chunk.Type == sdk.ChunkDone { break } fmt.Print(chunk.Text)}Non-Streaming Image Analysis
Section titled “Non-Streaming Image Analysis”resp, err := conv.Send(ctx, "Describe this image", sdk.WithImageURL("https://example.com/image.jpg"),)if err != nil { log.Fatalf("Error: %v", err)}fmt.Println(resp.Text())Image Input Options
Section titled “Image Input Options”The SDK supports multiple ways to provide images:
From URL
Section titled “From URL”sdk.WithImageURL("https://example.com/image.jpg")From File
Section titled “From File”sdk.WithImageFile("/path/to/local/image.png")From Raw Data
Section titled “From Raw Data”sdk.WithImageData(imageBytes, "image/png")Supported Providers
Section titled “Supported Providers”Multimodal capabilities require a provider that supports vision:
- Gemini (recommended): Full multimodal support with streaming
- OpenAI GPT-4V: Vision capabilities with GPT-4 Vision models
- Claude: Vision support with Claude 3 models
Pack Configuration
Section titled “Pack Configuration”The pack file configures the vision analyst prompt:
{ "prompts": { "vision-analyst": { "id": "vision-analyst", "name": "Vision Analyst", "system_template": "You are an expert visual analyst...", "parameters": { "temperature": 0.7, "max_tokens": 1024 } } }}- Image analysis typically requires more tokens than text-only requests
- Large images may be resized by the provider for processing
- Some providers have limits on image size and format