Skip to content

Multimodal SDK Example

This example demonstrates multimodal (vision) capabilities using the PromptKit SDK with streaming responses.

  • Image Analysis: Send images with text prompts for visual analysis
  • Streaming Responses: Get real-time streaming output as the model analyzes images
  • Conversation Context: Follow-up questions maintain context about previously analyzed images
  • Multiple Input Methods: Support for image URLs, file paths, and raw image data
  1. A Google Gemini API key (for vision capabilities)
  2. Go 1.21 or later
Terminal window
export GEMINI_API_KEY=your-gemini-api-key
Terminal window
cd sdk/examples/multimodal
go run .
conv, err := sdk.Open("./multimodal.pack.json", "vision-analyst")
if err != nil {
log.Fatalf("Failed to open pack: %v", err)
}
defer conv.Close()
for chunk := range conv.Stream(ctx, "What do you see in this image?",
sdk.WithImageURL("https://example.com/image.jpg"),
) {
if chunk.Error != nil {
log.Printf("Error: %v", chunk.Error)
break
}
if chunk.Type == sdk.ChunkDone {
break
}
fmt.Print(chunk.Text)
}
resp, err := conv.Send(ctx, "Describe this image",
sdk.WithImageURL("https://example.com/image.jpg"),
)
if err != nil {
log.Fatalf("Error: %v", err)
}
fmt.Println(resp.Text())

The SDK supports multiple ways to provide images:

sdk.WithImageURL("https://example.com/image.jpg")
sdk.WithImageFile("/path/to/local/image.png")
sdk.WithImageData(imageBytes, "image/png")

Multimodal capabilities require a provider that supports vision:

  • Gemini (recommended): Full multimodal support with streaming
  • OpenAI GPT-4V: Vision capabilities with GPT-4 Vision models
  • Claude: Vision support with Claude 3 models

The pack file configures the vision analyst prompt:

{
"prompts": {
"vision-analyst": {
"id": "vision-analyst",
"name": "Vision Analyst",
"system_template": "You are an expert visual analyst...",
"parameters": {
"temperature": 0.7,
"max_tokens": 1024
}
}
}
}
  • Image analysis typically requires more tokens than text-only requests
  • Large images may be resized by the provider for processing
  • Some providers have limits on image size and format