This example demonstrates multimodal (vision) capabilities using the PromptKit SDK with streaming responses.
Features
- Image Analysis: Send images with text prompts for visual analysis
- Streaming Responses: Get real-time streaming output as the model analyzes images
- Conversation Context: Follow-up questions maintain context about previously analyzed images
- Multiple Input Methods: Support for image URLs, file paths, and raw image data
Prerequisites
- A Google Gemini API key (for vision capabilities)
- Go 1.21 or later
Setup
export GEMINI_API_KEY=your-gemini-api-key
Running the Example
cd sdk/examples/multimodal
go run .
How It Works
Opening a Multimodal Conversation
conv, err := sdk.Open("./multimodal.pack.json", "vision-analyst")
if err != nil {
log.Fatalf("Failed to open pack: %v", err)
}
defer conv.Close()
Streaming Image Analysis
for chunk := range conv.Stream(ctx, "What do you see in this image?",
sdk.WithImageURL("https://example.com/image.jpg"),
) {
if chunk.Error != nil {
log.Printf("Error: %v", chunk.Error)
break
}
if chunk.Type == sdk.ChunkDone {
break
}
fmt.Print(chunk.Text)
}
Non-Streaming Image Analysis
resp, err := conv.Send(ctx, "Describe this image",
sdk.WithImageURL("https://example.com/image.jpg"),
)
if err != nil {
log.Fatalf("Error: %v", err)
}
fmt.Println(resp.Text())
Image Input Options
The SDK supports multiple ways to provide images:
From URL
sdk.WithImageURL("https://example.com/image.jpg")
From File
sdk.WithImageFile("/path/to/local/image.png")
From Raw Data
sdk.WithImageData(imageBytes, "image/png")
Supported Providers
Multimodal capabilities require a provider that supports vision:
- Gemini (recommended): Full multimodal support with streaming
- OpenAI GPT-4V: Vision capabilities with GPT-4 Vision models
- Claude: Vision support with Claude 3 models
Pack Configuration
The pack file configures the vision analyst prompt:
{
"prompts": {
"vision-analyst": {
"id": "vision-analyst",
"name": "Vision Analyst",
"system_template": "You are an expert visual analyst...",
"parameters": {
"temperature": 0.7,
"max_tokens": 1024
}
}
}
}
Notes
- Image analysis typically requires more tokens than text-only requests
- Large images may be resized by the provider for processing
- Some providers have limits on image size and format
Was this page helpful?