Skip to content

Multimodal Basics Example

This example demonstrates how to configure and use multimodal media support in PromptKit, with runnable PromptArena tests using both real models and mock providers.

Terminal window
# Copy environment template
cp .env.example .env
# Add your Gemini API key to .env (for real model testing)
echo "GEMINI_API_KEY=your_actual_key" > .env
Terminal window
# Run with mock provider (no API keys needed - perfect for CI/CD)
promptarena run arena.yaml --provider mock-vision
# Run with real Gemini vision model (requires API key)
promptarena run arena.yaml --provider gemini-vision
# Run all providers and scenarios
promptarena run arena.yaml
Terminal window
# View HTML report
open out/multimodal-report.html
# Or check JSON results
cat out/results.json | jq
multimodal-basics/
├── arena.yaml # Arena configuration
├── prompts/
│ ├── image-analyzer.yaml # Multimodal prompt with MediaConfig
│ ├── audio-transcriber.yaml # Audio analysis example
│ └── mixed-media-assistant.yaml # Combined media types
├── providers/
│ ├── gemini-vision.yaml # Real Gemini model configuration
│ └── mock-vision.yaml # Mock provider for testing
├── scenarios/
│ ├── image-analysis.yaml # Single image test
│ └── image-comparison.yaml # Multiple images test
├── mock-responses.yaml # Deterministic mock responses
└── .env.example # Environment template

Image Analysis (scenarios/image-analysis.yaml)

  • Tests single image description capabilities
  • Validates response includes image-related content
  • Checks length constraints (50-2000 characters)
  • Tests color and composition analysis

Image Comparison (scenarios/image-comparison.yaml)

  • Tests comparing multiple images
  • Validates comparative analysis
  • Checks professional assessment capabilities
  • Tests detailed technical descriptions

Mock Provider (providers/mock-vision.yaml)

  • Deterministic testing without API calls
  • Scenario-specific responses in mock-responses.yaml
  • Turn-by-turn scripted responses
  • Perfect for CI/CD testing and development

Gemini Vision (providers/gemini-vision.yaml)

  • Real multimodal model testing
  • Requires GEMINI_API_KEY in .env
  • Tests actual vision capabilities
  • Validates MediaConfig constraints work with real models

Each prompt includes a media: section that defines capabilities:

media:
enabled: true
supported_types:
- image
image:
max_size_mb: 20
allowed_formats: [jpeg, png, webp]
default_detail: high
max_images_per_msg: 5
examples:
- name: "single-image-analysis"
role: user
parts:
- type: text
text: "What's in this image?"
- type: image
media:
file_path: "./test-images/sample.jpg"
mime_type: "image/jpeg"
Terminal window
# Test only image analysis scenario
promptarena run arena.yaml --scenario image-analysis
# Test only mock provider
promptarena run arena.yaml --provider mock-vision
# Test with specific prompt config
promptarena run arena.yaml --prompt image-analyzer
# Increase concurrency for faster testing
promptarena run arena.yaml --concurrency 5
# Generate only HTML output
promptarena run arena.yaml --output-format html

The mock-responses.yaml file provides deterministic responses for testing:

scenarios:
image-analysis:
turns:
1: "I can see a landscape with mountains..."
2: "The color palette is predominantly warm..."
image-comparison:
turns:
1: "Comparing these two images, I can identify..."
2: "Image 2 appears more professional because..."

Benefits of mock responses:

  • ✅ Predictable test results
  • ✅ No API costs during development
  • ✅ Fast iteration on prompt logic
  • ✅ CI/CD testing without secrets
  • ✅ Scenario-specific responses

MediaConfig defines capabilities only (types, sizes, formats supported).

Actual media files are:

  • ✅ Provided by users as input at runtime
  • ✅ Returned by models as output
  • ❌ NOT bundled into prompt packs

Benefits:

  • Lightweight packs (no embedded media)
  • Clear separation of prompt logic and user data
  • Dynamic media handling at runtime
  • No storage/versioning concerns for media

Multimodal support requires provider-level implementation:

  • Gemini 2.0: Images, audio, video ✅
  • GPT-4V: Images ✅
  • Claude 3: Images ✅

MediaConfig advertises what the prompt supports; actual handling depends on provider capabilities.

MediaConfig is validated at compile time by PackC:

Terminal window
# Compile and validate
packc compile prompts/image-analyzer.yaml -o image-analyzer.pack.json
# Check media configuration in compiled pack
jq '.prompts[].media' image-analyzer.pack.json

Validation checks:

  • ✅ Supported types are valid (image, audio, video)
  • ✅ Size limits are positive numbers
  • ✅ Formats are from allowed lists
  • ✅ Examples have proper structure (role, parts, media)
  • ⚠️ Warns about missing example files (non-blocking)
Terminal window
# Test media validation logic
cd ../../runtime/prompt
go test -v -run TestValidateMediaConfig
# Test pack loading with MediaConfig
go test -v -run TestLoadPackWithMediaConfig

Test coverage:

  • 53 comprehensive test cases
  • 77.8-100% coverage across media validation functions
  • Tests for all media types and configurations
  • Edge cases and error conditions
Terminal window
# Run all scenarios with both providers
promptarena run arena.yaml
# Run with verbose output
promptarena run arena.yaml -v
# Generate detailed HTML report
promptarena run arena.yaml --output-dir ./test-results
Running Arena: multimodal-basics
Loaded 1 prompt configs
Loaded 2 providers
Loaded 2 scenarios
Running scenario: image-analysis (mock-vision)
Turn 1: ✓ Passed (content_includes: "image", "see")
Turn 1: ✓ Passed (min_length: 50 chars)
Turn 1: ✓ Passed (max_length: 2000 chars)
Turn 2: ✓ Passed (content_includes: "color")
Turn 2: ✓ Passed (min_length: 30 chars)
Running scenario: image-comparison (mock-vision)
Turn 1: ✓ Passed (content_includes: "image", "difference")
Turn 1: ✓ Passed (min_length: 100 chars)
Turn 2: ✓ Passed (content_includes: "professional")
Summary:
Total Tests: 8
Passed: 8
Failed: 0
Success Rate: 100%
Report saved to: out/multimodal-report.html
Error: GEMINI_API_KEY not set
Solution: cp .env.example .env && edit .env with your key
Error: Provider 'gemini-vision' not found
Solution: Check that providers/gemini-vision.yaml exists and arena.yaml references it correctly

Check the HTML report for detailed failure information:

Terminal window
open out/multimodal-report.html
# Look for red X marks and click for details
Error: Mock provider returning default response
Solution: Check that scenario name in arena.yaml matches key in mock-responses.yaml

Create test images for actual multimodal testing:

Terminal window
mkdir test-images
# Add sample images: sample-photo.jpg, before.jpg, after.jpg

Add more complex test cases:

  • OCR text extraction from images
  • Object detection validation
  • Style transfer comparison
  • Multi-step image reasoning

Test other media types:

Terminal window
# Copy and modify audio-transcriber.yaml
# Add audio scenarios
# Configure mock responses for audio tests

Use mock provider in automated tests:

# .github/workflows/test.yml
- name: Test Multimodal
run: |
cd examples/multimodal-basics
promptarena run arena.yaml --provider mock-vision