Skip to content

Providers

LLM provider implementations with unified API.

PromptKit supports multiple LLM providers through a common interface:

  • OpenAI: GPT-4, GPT-4o, GPT-3.5
  • Anthropic Claude: Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku
  • Google Gemini: Gemini 1.5 Pro, Gemini 1.5 Flash
  • Ollama: Local LLMs (Llama, Mistral, LLaVA, DeepSeek)
  • vLLM: High-performance inference (GPU-accelerated, high-throughput)
  • Mock: Testing and development

All providers implement the Provider interface for text completion and ToolSupport interface for function calling.

type Provider interface {
ID() string
Predict(ctx context.Context, req PredictionRequest) (PredictionResponse, error)
// Streaming
PredictStream(ctx context.Context, req PredictionRequest) (<-chan StreamChunk, error)
SupportsStreaming() bool
ShouldIncludeRawOutput() bool
Close() error
// Cost calculation
CalculateCost(inputTokens, outputTokens, cachedTokens int) types.CostInfo
}
type ToolSupport interface {
Provider
// Convert tools to provider format
BuildTooling(descriptors []*ToolDescriptor) (interface{}, error)
// Execute with tools
PredictWithTools(
ctx context.Context,
req PredictionRequest,
tools interface{},
toolChoice string,
) (PredictionResponse, []types.MessageToolCall, error)
}

Providers that support images, audio, or video inputs implement MultimodalSupport:

type MultimodalSupport interface {
Provider
// Get supported multimodal capabilities
GetMultimodalCapabilities() MultimodalCapabilities
// Execute with multimodal content
PredictMultimodal(ctx context.Context, req PredictionRequest) (PredictionResponse, error)
// Stream with multimodal content
PredictMultimodalStream(ctx context.Context, req PredictionRequest) (<-chan StreamChunk, error)
}

MultimodalCapabilities:

type MultimodalCapabilities struct {
SupportsImages bool // Can process image inputs
SupportsAudio bool // Can process audio inputs
SupportsVideo bool // Can process video inputs
ImageFormats []string // Supported image MIME types
AudioFormats []string // Supported audio MIME types
VideoFormats []string // Supported video MIME types
MaxImageSizeMB int // Max image size (0 = unlimited/unknown)
MaxAudioSizeMB int // Max audio size (0 = unlimited/unknown)
MaxVideoSizeMB int // Max video size (0 = unlimited/unknown)
}

Provider Multimodal Support:

ProviderImagesAudioVideoNotes
OpenAI GPT-4o/4o-miniJPEG, PNG, GIF, WebP
Anthropic Claude 3.5JPEG, PNG, GIF, WebP
Google Gemini 1.5Full multimodal support
Ollama (LLaVA, Llama 3.2 Vision)JPEG, PNG, GIF, WebP
vLLM (LLaVA, vision models)JPEG, PNG, GIF, WebP, 20MB limit

Helper Functions:

// Check if provider supports multimodal
func SupportsMultimodal(p Provider) bool
// Get multimodal provider (returns nil if not supported)
func GetMultimodalProvider(p Provider) MultimodalSupport
// Check specific media type support
func HasImageSupport(p Provider) bool
func HasAudioSupport(p Provider) bool
func HasVideoSupport(p Provider) bool
// Check format compatibility
func IsFormatSupported(p Provider, contentType string, mimeType string) bool
// Validate message compatibility
func ValidateMultimodalMessage(p Provider, msg types.Message) error

Usage Example:

// Check capabilities
if providers.HasImageSupport(provider) {
caps := providers.GetMultimodalProvider(provider).GetMultimodalCapabilities()
fmt.Printf("Max image size: %d MB\n", caps.MaxImageSizeMB)
}
// Send multimodal request
req := providers.PredictionRequest{
System: "You are a helpful assistant.",
Messages: []types.Message{
{
Role: "user",
Parts: []types.ContentPart{
{Type: "text", Text: "What's in this image?"},
{
Type: "image",
Media: &types.MediaContent{
Type: "image",
MIMEType: "image/jpeg",
Data: imageBase64,
},
},
},
},
},
}
if mp := providers.GetMultimodalProvider(provider); mp != nil {
resp, err := mp.PredictMultimodal(ctx, req)
}

Providers that support both multimodal content and function calling implement MultimodalToolSupport:

type MultimodalToolSupport interface {
MultimodalSupport
ToolSupport
// Execute with both multimodal content and tools
PredictMultimodalWithTools(
ctx context.Context,
req PredictionRequest,
tools interface{},
toolChoice string,
) (PredictionResponse, []types.MessageToolCall, error)
}

Usage Example:

// Use images with tool calls
tools, _ := provider.BuildTooling(toolDescriptors)
resp, toolCalls, err := provider.PredictMultimodalWithTools(
ctx,
multimodalRequest,
tools,
"auto",
)
// Response contains both text and any tool calls
fmt.Println(resp.Content)
for _, call := range toolCalls {
fmt.Printf("Tool called: %s\n", call.Name)
}
type PredictionRequest struct {
System string
Messages []types.Message
Temperature float32
TopP float32
MaxTokens int
Seed *int
Metadata map[string]interface{}
}
type PredictionResponse struct {
Content string
Parts []types.ContentPart // Multimodal content
CostInfo *types.CostInfo
Latency time.Duration
Raw []byte // Raw API response
RawRequest interface{} // Raw API request
ToolCalls []types.MessageToolCall
}
type ProviderDefaults struct {
Temperature float32
TopP float32
MaxTokens int
Pricing Pricing
}
type Pricing struct {
InputCostPer1K float64 // Per 1K input tokens
OutputCostPer1K float64 // Per 1K output tokens
}
func NewOpenAIProvider(
id string,
model string,
baseURL string,
defaults ProviderDefaults,
includeRawOutput bool,
) *OpenAIProvider

Parameters:

  • id: Provider identifier (e.g., “openai-gpt4”)
  • model: Model name (e.g., “gpt-4o-mini”, “gpt-4-turbo”)
  • baseURL: Custom API URL (empty for default https://api.openai.com/v1)
  • defaults: Default parameters and pricing
  • includeRawOutput: Include raw API response in output

Environment:

  • OPENAI_API_KEY: Required API key

Example:

provider := openai.NewOpenAIProvider(
"openai",
"gpt-4o-mini",
"", // Use default URL
openai.DefaultProviderDefaults(),
false,
)
defer provider.Close()
ModelContextCost (Input/Output per 1M tokens)
gpt-4o128K$2.50 / $10.00
gpt-4o-mini128K$0.15 / $0.60
gpt-4-turbo128K$10.00 / $30.00
gpt-48K$30.00 / $60.00
gpt-3.5-turbo16K$0.50 / $1.50
  • ✅ Streaming support
  • ✅ Function calling
  • ✅ Multimodal (vision)
  • ✅ JSON mode
  • ✅ Seed for reproducibility
  • ✅ Token counting
// Create tool provider
toolProvider := openai.NewOpenAIToolProvider(
"openai",
"gpt-4o-mini",
"",
openai.DefaultProviderDefaults(),
false,
nil, // Additional config
)
// Build tools in OpenAI format
tools, err := toolProvider.BuildTooling(toolDescriptors)
if err != nil {
log.Fatal(err)
}
// Execute with tools
response, toolCalls, err := toolProvider.PredictWithTools(
ctx,
req,
tools,
"auto", // Tool choice: "auto", "required", "none", or specific tool
)
func NewClaudeProvider(
id string,
model string,
baseURL string,
defaults ProviderDefaults,
includeRawOutput bool,
) *ClaudeProvider

Environment:

  • ANTHROPIC_API_KEY: Required API key

Example:

provider := claude.NewClaudeProvider(
"claude",
"claude-3-5-sonnet-20241022",
"", // Use default URL
claude.DefaultProviderDefaults(),
false,
)
defer provider.Close()
ModelContextCost (Input/Output per 1M tokens)
claude-3-5-sonnet-20241022200K$3.00 / $15.00
claude-3-opus-20240229200K$15.00 / $75.00
claude-3-haiku-20240307200K$0.25 / $1.25
  • ✅ Streaming support
  • ✅ Tool calling
  • ✅ Multimodal (vision)
  • ✅ Extended context (200K tokens)
  • ✅ Prompt caching
  • ✅ System prompts
toolProvider := claude.NewClaudeToolProvider(
"claude",
"claude-3-5-sonnet-20241022",
"",
claude.DefaultProviderDefaults(),
false,
)
response, toolCalls, err := toolProvider.PredictWithTools(ctx, req, tools, "auto")
func NewGeminiProvider(
id string,
model string,
baseURL string,
defaults ProviderDefaults,
includeRawOutput bool,
) *GeminiProvider

Environment:

  • GEMINI_API_KEY: Required API key

Example:

provider := gemini.NewGeminiProvider(
"gemini",
"gemini-1.5-flash",
"",
gemini.DefaultProviderDefaults(),
false,
)
defer provider.Close()
ModelContextCost (Input/Output per 1M tokens)
gemini-1.5-pro2M$1.25 / $5.00
gemini-1.5-flash1M$0.075 / $0.30
  • ✅ Streaming support
  • ✅ Function calling
  • ✅ Multimodal (vision, audio, video)
  • ✅ Extended context (up to 2M tokens)
  • ✅ Grounding with Google Search
toolProvider := gemini.NewGeminiToolProvider(
"gemini",
"gemini-1.5-flash",
"",
gemini.DefaultProviderDefaults(),
false,
)
response, toolCalls, err := toolProvider.PredictWithTools(ctx, req, tools, "auto")

For testing and development.

func NewMockProvider(
id string,
model string,
includeRawOutput bool,
) *MockProvider

Example:

provider := mock.NewMockProvider("mock", "test-model", false)
// Configure responses
provider.AddResponse("Hello", "Hi there!")
provider.AddResponse("What is 2+2?", "4")
// Custom response repository
repo := &CustomMockRepository{
responses: map[string]string{
"hello": "Hello! How can I help?",
"bye": "Goodbye!",
},
}
provider := mock.NewMockProviderWithRepository("mock", "test-model", false, repo)
toolProvider := mock.NewMockToolProvider("mock", "test-model", false, nil)
// Configure tool call responses
toolProvider.ConfigureToolResponse("get_weather", `{"temp": 72, "conditions": "sunny"}`)

Run local LLMs with zero API costs using Ollama. Uses the OpenAI-compatible /v1/chat/completions endpoint.

func NewOllamaProvider(
id string,
model string,
baseURL string,
defaults ProviderDefaults,
includeRawOutput bool,
additionalConfig map[string]interface{},
) *OllamaProvider

Parameters:

  • id: Provider identifier (e.g., “ollama-llama”)
  • model: Model name (e.g., “llama3.2:1b”, “mistral”, “llava”)
  • baseURL: Ollama server URL (default: http://localhost:11434)
  • defaults: Default parameters and pricing (typically zero cost)
  • includeRawOutput: Include raw API response in output
  • additionalConfig: Extra options including keep_alive for model persistence

Environment:

  • No API key required (local inference)
  • OLLAMA_HOST: Optional, alternative to baseURL parameter

Example:

provider := ollama.NewOllamaProvider(
"ollama",
"llama3.2:1b",
"http://localhost:11434",
ollama.DefaultProviderDefaults(),
false,
map[string]interface{}{
"keep_alive": "5m", // Keep model loaded for 5 minutes
},
)
defer provider.Close()

Any model available via ollama pull. Common models include:

ModelContextCost
llama3.2:1b128KFree (local)
llama3.2:3b128KFree (local)
llama3.1:8b128KFree (local)
mistral32KFree (local)
deepseek-r1:8b64KFree (local)
phi3:mini128KFree (local)
llava4KFree (local)
llama3.2-vision128KFree (local)

Run ollama list to see installed models, or ollama pull <model> to download new ones.

  • ✅ Streaming support
  • ✅ Function calling (tool use)
  • ✅ Multimodal (vision) - LLaVA, Llama 3.2 Vision
  • ✅ Zero cost (local inference)
  • ✅ Model persistence (keep_alive parameter)
  • ✅ OpenAI-compatible API
  • ❌ No API key required
toolProvider := ollama.NewOllamaToolProvider(
"ollama",
"llama3.2:1b",
"http://localhost:11434",
ollama.DefaultProviderDefaults(),
false,
map[string]interface{}{"keep_alive": "5m"},
)
// Build tools in OpenAI format
tools, err := toolProvider.BuildTooling(toolDescriptors)
if err != nil {
log.Fatal(err)
}
// Execute with tools
response, toolCalls, err := toolProvider.PredictWithTools(
ctx,
req,
tools,
"auto", // Tool choice: "auto", "required", "none"
)
spec:
id: "ollama-llama"
type: ollama
model: llama3.2:1b
base_url: "http://localhost:11434"
additional_config:
keep_alive: "5m"

Run Ollama with Docker Compose:

services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
healthcheck:
test: ["CMD-SHELL", "ollama list || exit 1"]
interval: 10s
timeout: 30s
retries: 5
start_period: 30s
volumes:
ollama_data:

Then pull a model:

Terminal window
docker exec ollama ollama pull llama3.2:1b

High-throughput inference engine optimized for GPU-accelerated LLM serving. Supports guided decoding, advanced sampling strategies, and efficient batching for maximum performance.

func NewVLLMProvider(
id string,
model string,
baseURL string,
defaults ProviderDefaults,
includeRawOutput bool,
additionalConfig map[string]interface{},
) *VLLMProvider

Parameters:

  • id: Provider identifier (e.g., “vllm-llama”)
  • model: Model name served by vLLM (e.g., “meta-llama/Llama-3.2-1B-Instruct”)
  • baseURL: vLLM server URL (default: http://localhost:8000)
  • defaults: Default parameters and pricing (typically zero cost for self-hosted)
  • includeRawOutput: Include raw API response in output
  • additionalConfig: vLLM-specific options (guided decoding, beam search, etc.)

Environment:

  • No API key required (self-hosted inference)
  • VLLM_API_KEY: Optional API key if vLLM server configured with authentication

Example:

provider := vllm.NewVLLMProvider(
"vllm",
"meta-llama/Llama-3.2-1B-Instruct",
"http://localhost:8000",
vllm.DefaultProviderDefaults(),
false,
map[string]interface{}{
"use_beam_search": false,
"best_of": 1,
},
)
defer provider.Close()

Any model supported by vLLM from HuggingFace. Common models include:

ModelContextFeaturesCost
meta-llama/Llama-3.2-1B-Instruct128KFast, compactFree (self-hosted)
meta-llama/Llama-3.2-3B-Instruct128KBalancedFree (self-hosted)
meta-llama/Llama-3.1-8B-Instruct128KHigh qualityFree (self-hosted)
mistralai/Mistral-7B-Instruct-v0.332KGood performanceFree (self-hosted)
Qwen/Qwen2.5-7B-Instruct128KMultilingualFree (self-hosted)
microsoft/Phi-3-mini-128k-instruct128KEfficientFree (self-hosted)
llava-hf/llava-v1.6-mistral-7b-hf4KVision supportFree (self-hosted)

Check vLLM documentation for full model compatibility list.

  • ✅ Streaming support (SSE)
  • ✅ Function calling (tool use)
  • ✅ Multimodal (vision) - LLaVA and vision-capable models
  • ✅ GPU acceleration (CUDA, ROCm)
  • ✅ Guided decoding (JSON schema, regex, grammar)
  • ✅ Beam search for higher quality
  • ✅ Tensor parallelism for large models
  • ✅ PagedAttention for memory efficiency
  • ✅ Continuous batching for throughput
  • ✅ OpenAI-compatible API
  • ✅ Zero API costs (self-hosted)
  • ❌ No API key required

Additional configuration options beyond standard OpenAI parameters:

additionalConfig := map[string]interface{}{
// Sampling & Quality
"use_beam_search": false, // Enable beam search (slower, higher quality)
"best_of": 1, // Generate N candidates, return best
"ignore_eos": false, // Continue past EOS token
// Guided Decoding
"guided_json": jsonSchema, // Force JSON output matching schema
"guided_regex": "^[0-9]+$", // Force output matching regex
"guided_choice": []string{"yes", "no"}, // Force choice from options
"guided_grammar": bnfGrammar, // Force output matching BNF grammar
"guided_whitespace_pattern": nil, // Custom whitespace handling
}

Example with Guided JSON:

// Force JSON output matching schema
jsonSchema := `{
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
},
"required": ["name", "age"]
}`
provider := vllm.NewVLLMProvider(
"vllm",
"meta-llama/Llama-3.2-3B-Instruct",
"http://localhost:8000",
vllm.DefaultProviderDefaults(),
false,
map[string]interface{}{
"guided_json": jsonSchema,
},
)
toolProvider := vllm.NewVLLMToolProvider(
"vllm",
"meta-llama/Llama-3.2-1B-Instruct",
"http://localhost:8000",
vllm.DefaultProviderDefaults(),
false,
nil,
)
// Build tools in OpenAI format
tools, err := toolProvider.BuildTooling(toolDescriptors)
if err != nil {
log.Fatal(err)
}
// Execute with tools
response, toolCalls, err := toolProvider.PredictWithTools(
ctx,
req,
tools,
"auto", // Tool choice: "auto", "required", "none", or specific tool name
)
spec:
id: "vllm-llama"
type: vllm
model: meta-llama/Llama-3.2-3B-Instruct
base_url: "http://localhost:8000"
additional_config:
use_beam_search: false
best_of: 1

Run vLLM with Docker:

Terminal window
# CPU-only (for testing)
docker run --rm -it \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.2-1B-Instruct \
--max-model-len 2048
# GPU-accelerated (recommended for production)
docker run --rm -it \
--gpus all \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.2-3B-Instruct \
--dtype half \
--max-model-len 4096

Docker Compose:

services:
vllm:
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
volumes:
- vllm_cache:/root/.cache/huggingface
command:
- --model
- meta-llama/Llama-3.2-3B-Instruct
- --dtype
- half
- --max-model-len
- "4096"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
vllm_cache:

Tensor Parallelism (multi-GPU):

Terminal window
docker run --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 2 # Split across 2 GPUs

Memory Optimization:

Terminal window
docker run --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.2-3B-Instruct \
--gpu-memory-utilization 0.9 # Use 90% of GPU memory
--max-model-len 8192

Quantization (reduce memory):

Terminal window
docker run --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization awq # or 'gptq', 'squeezellm'
FeaturevLLMOllama
Target Use CaseHigh-performance GPU inferenceLocal development, ease of use
GPU AccelerationRequired (CUDA/ROCm)Optional
ThroughputVery high (continuous batching)Moderate
Model LoadingHuggingFace models directlyollama pull model management
Guided Decoding✅ JSON schema, regex, grammar
Beam Search
Tensor Parallelism✅ Multi-GPU support
Quantization✅ AWQ, GPTQ, SqueezeLLM✅ GGUF format
API CompatibilityOpenAI-compatibleOpenAI-compatible
Setup ComplexityModerate (GPU drivers, Docker)Low (single binary)
Memory EfficiencyPagedAttentionStandard
CostFree (self-hosted)Free (self-hosted)

When to use vLLM:

  • GPU-accelerated inference for performance
  • Multi-GPU setups for large models
  • Need structured output (guided decoding)
  • Batch processing workloads requiring high throughput
  • Advanced sampling strategies (beam search)

When to use Ollama:

  • Local development and testing
  • CPU-only environments
  • Quick model experimentation
  • Simpler setup requirements
import (
"context"
"github.com/AltairaLabs/PromptKit/runtime/providers/openai"
)
provider := openai.NewOpenAIProvider(
"openai",
"gpt-4o-mini",
"",
openai.DefaultProviderDefaults(),
false,
)
defer provider.Close()
req := providers.PredictionRequest{
System: "You are a helpful assistant.",
Messages: []types.Message{
{Role: "user", Content: "What is 2+2?"},
},
Temperature: 0.7,
MaxTokens: 100,
}
ctx := context.Background()
response, err := provider.Predict(ctx, req)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Response: %s\n", response.Content)
fmt.Printf("Cost: $%.6f\n", response.CostInfo.TotalCost)
fmt.Printf("Latency: %v\n", response.Latency)
streamChan, err := provider.PredictStream(ctx, req)
if err != nil {
log.Fatal(err)
}
var fullContent string
for chunk := range streamChan {
if chunk.Error != nil {
log.Printf("Stream error: %v\n", chunk.Error)
break
}
if chunk.Delta != "" {
fullContent += chunk.Delta
fmt.Print(chunk.Delta)
}
if chunk.Done {
fmt.Printf("\n\nComplete! Tokens: %d\n", chunk.TokenCount)
}
}
toolProvider := openai.NewOpenAIToolProvider(
"openai",
"gpt-4o-mini",
"",
openai.DefaultProviderDefaults(),
false,
nil,
)
// Define tools
toolDescs := []*providers.ToolDescriptor{
{
Name: "get_weather",
Description: "Get current weather for a location",
InputSchema: json.RawMessage(`{
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}`),
},
}
// Build tools in provider format
tools, err := toolProvider.BuildTooling(toolDescs)
if err != nil {
log.Fatal(err)
}
// Execute with tools
req.Messages = []types.Message{
{Role: "user", Content: "What's the weather in San Francisco?"},
}
response, toolCalls, err := toolProvider.PredictWithTools(ctx, req, tools, "auto")
if err != nil {
log.Fatal(err)
}
// Process tool calls
for _, call := range toolCalls {
fmt.Printf("Tool: %s\n", call.Name)
fmt.Printf("Args: %s\n", call.Arguments)
}
// Create message with image
msg := types.Message{
Role: "user",
Content: "What's in this image?",
Parts: []types.ContentPart{
{
Type: "image",
ImageURL: &types.ImageURL{
URL: "...",
},
},
},
}
req.Messages = []types.Message{msg}
response, err := provider.Predict(ctx, req)
// Manual cost calculation
costInfo := provider.CalculateCost(
1000, // Input tokens
500, // Output tokens
0, // Cached tokens
)
fmt.Printf("Input cost: $%.6f\n", costInfo.InputCost)
fmt.Printf("Output cost: $%.6f\n", costInfo.OutputCost)
fmt.Printf("Total cost: $%.6f\n", costInfo.TotalCost)
// Custom pricing
customDefaults := providers.ProviderDefaults{
Temperature: 0.8,
TopP: 0.95,
MaxTokens: 2000,
Pricing: providers.Pricing{
InputCostPer1K: 0.0001,
OutputCostPer1K: 0.0002,
},
}
provider := openai.NewOpenAIProvider(
"custom-openai",
"gpt-4o-mini",
"",
customDefaults,
true, // Include raw output
)

OpenAI:

func DefaultProviderDefaults() ProviderDefaults {
return ProviderDefaults{
Temperature: 0.7,
TopP: 1.0,
MaxTokens: 2000,
Pricing: Pricing{
InputCostPer1K: 0.00015, // gpt-4o-mini
OutputCostPer1K: 0.0006,
},
}
}

Claude:

func DefaultProviderDefaults() ProviderDefaults {
return ProviderDefaults{
Temperature: 0.7,
TopP: 1.0,
MaxTokens: 4096,
Pricing: Pricing{
InputCostPer1K: 0.003, // claude-3-5-sonnet
OutputCostPer1K: 0.015,
},
}
}

Gemini:

func DefaultProviderDefaults() ProviderDefaults {
return ProviderDefaults{
Temperature: 0.7,
TopP: 0.95,
MaxTokens: 8192,
Pricing: Pricing{
InputCostPer1K: 0.000075, // gemini-1.5-flash
OutputCostPer1K: 0.0003,
},
}
}

Ollama:

func DefaultProviderDefaults() ProviderDefaults {
return ProviderDefaults{
Temperature: 0.7,
TopP: 0.9,
MaxTokens: 2048,
Pricing: Pricing{
InputCostPer1K: 0.0, // Local inference - free
OutputCostPer1K: 0.0,
},
}
}

vLLM:

func DefaultProviderDefaults() ProviderDefaults {
return ProviderDefaults{
Temperature: 0.7,
TopP: 0.95,
MaxTokens: 2048,
Pricing: Pricing{
InputCostPer1K: 0.0, // Self-hosted inference - free
OutputCostPer1K: 0.0,
},
}
}

All providers support environment variable configuration:

  • OPENAI_API_KEY: OpenAI authentication
  • ANTHROPIC_API_KEY: Anthropic authentication
  • GEMINI_API_KEY: Google Gemini authentication
  • OPENAI_BASE_URL: Custom OpenAI-compatible endpoint
  • ANTHROPIC_BASE_URL: Custom Claude endpoint
  • GEMINI_BASE_URL: Custom Gemini endpoint
  • OLLAMA_HOST: Ollama server URL (default: http://localhost:11434)
  • VLLM_API_KEY: Optional vLLM authentication (if server configured with auth)
  • VLLM_BASE_URL: vLLM server URL (default: http://localhost:8000)

PromptKit provides a flexible credential system for authenticating with LLM providers.

type Credential interface {
// Apply adds authentication to an HTTP request
Apply(ctx context.Context, req *http.Request) error
// Type returns the credential type identifier
Type() string
}

Standard API key authentication (most providers):

cred := credentials.NewAPIKeyCredential("sk-your-key",
credentials.WithHeaderName("Authorization"), // Default
credentials.WithPrefix("Bearer "), // Default
)

AWS SigV4 signing for Bedrock:

cred, err := credentials.NewAWSCredential(ctx, "us-west-2")
// Uses AWS SDK default credential chain:
// - IRSA (EKS)
// - IAM instance roles
// - AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY

GCP OAuth tokens for Vertex AI:

cred, err := credentials.NewGCPCredential(ctx, "us-central1", "my-project")
// Uses Application Default Credentials:
// - Workload Identity (GKE)
// - Service account keys
// - GOOGLE_APPLICATION_CREDENTIALS

Azure AD tokens for Azure AI:

cred, err := credentials.NewAzureCredential(ctx, "https://my-resource.openai.azure.com")
// Uses Azure SDK default credential chain:
// - Managed Identity
// - Azure CLI credentials
// - AZURE_CLIENT_ID / AZURE_TENANT_ID / AZURE_CLIENT_SECRET

The credentials.Resolve() function resolves credentials based on configuration:

import "github.com/AltairaLabs/PromptKit/runtime/credentials"
cfg := credentials.ResolverConfig{
ProviderType: "openai",
CredentialConfig: &config.CredentialConfig{
CredentialEnv: "MY_OPENAI_KEY", // Custom env var
},
}
cred, err := credentials.Resolve(ctx, cfg)
if err != nil {
log.Fatal(err)
}

Resolution Order:

  1. api_key - Direct API key value
  2. credential_file - Read from file path
  3. credential_env - Read from specified env var
  4. Default env vars - Provider-specific defaults

Default Environment Variables:

  • OpenAI: OPENAI_API_KEY, OPENAI_TOKEN
  • Claude: ANTHROPIC_API_KEY, CLAUDE_API_KEY
  • Gemini: GEMINI_API_KEY, GOOGLE_API_KEY

Platforms are hosting layers that wrap provider APIs with different authentication:

type PlatformConfig struct {
Type string // bedrock, vertex, azure
Region string // AWS/GCP region
Project string // GCP project ID
Endpoint string // Custom endpoint URL
AdditionalConfig map[string]interface{} // Platform-specific options
}
platform:
type: bedrock
region: us-west-2

Model names are automatically mapped:

  • claude-3-5-sonnet-20241022anthropic.claude-3-5-sonnet-20241022-v2:0
platform:
type: vertex
region: us-central1
project: my-gcp-project
platform:
type: azure
endpoint: https://my-resource.openai.azure.com
// Always close providers
provider := openai.NewOpenAIProvider(...)
defer provider.Close()
response, err := provider.Predict(ctx, req)
if err != nil {
// Check for specific error types
if strings.Contains(err.Error(), "rate_limit_exceeded") {
// Implement backoff
time.Sleep(time.Second * 5)
return retry()
}
return err
}
// Track costs across requests
var totalCost float64
for _, result := range results {
totalCost += result.CostInfo.TotalCost
}
fmt.Printf("Total spend: $%.6f\n", totalCost)
// Use context timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
response, err := provider.Predict(ctx, req)
// Always drain channel
streamChan, err := provider.PredictStream(ctx, req)
if err != nil {
return err
}
for chunk := range streamChan {
if chunk.Error != nil {
// Handle error but continue draining
log.Printf("Error: %v", chunk.Error)
continue
}
processChunk(chunk)
}
  • OpenAI: 200-500ms TTFT, 1-3s total for short responses
  • Claude: 300-600ms TTFT, similar total latency
  • Gemini: 150-400ms TTFT, faster for simple queries
  • Rate limits: Vary by provider and tier
    • OpenAI: 3,500-10,000 RPM
    • Claude: 4,000-50,000 RPM
    • Gemini: 2,000-15,000 RPM
  • Use mini/flash models for simple tasks
  • Implement caching for repeated queries
  • Use streaming for better UX (doesn’t reduce cost)
  • Monitor token usage and set appropriate MaxTokens

Unified interface for loading media content from various sources (inline data, storage references, file paths, URLs).

MediaLoader abstracts media access, allowing providers to load media transparently regardless of where it’s stored. This is essential for media externalization, where large media is stored on disk instead of being kept in memory.

import "github.com/AltairaLabs/PromptKit/runtime/providers"
loader := providers.NewMediaLoader(providers.MediaLoaderConfig{
StorageService: fileStore,
HTTPTimeout: 30 * time.Second,
MaxURLSizeBytes: 50 * 1024 * 1024, // 50 MB
})
type MediaLoader struct {
// Unexported fields
}

Creates a new MediaLoader with the specified configuration.

func NewMediaLoader(config MediaLoaderConfig) *MediaLoader

Parameters:

  • config - MediaLoaderConfig with storage service and options

Returns:

  • *MediaLoader - Ready-to-use media loader

Example:

import (
"github.com/AltairaLabs/PromptKit/runtime/providers"
"github.com/AltairaLabs/PromptKit/runtime/storage/local"
)
fileStore := local.NewFileStore(local.FileStoreConfig{
BaseDir: "./media",
})
loader := providers.NewMediaLoader(providers.MediaLoaderConfig{
StorageService: fileStore,
HTTPTimeout: 30 * time.Second,
MaxURLSizeBytes: 50 * 1024 * 1024,
})

Configuration for MediaLoader instances.

type MediaLoaderConfig struct {
StorageService storage.MediaStorageService // Required for storage references
HTTPTimeout time.Duration // Timeout for URL fetches
MaxURLSizeBytes int64 // Max size for URL content
}

Fields:

StorageService - Media storage backend

  • Required if loading from storage references
  • Typically a FileStore or cloud storage backend
  • Set to nil if not using media externalization

HTTPTimeout - HTTP request timeout for URLs

  • Default: 30 seconds
  • Applies to URL fetches only
  • Set to 0 for no timeout

MaxURLSizeBytes - Maximum size for URL content

  • Default: 50 MB
  • Prevents downloading huge files
  • Returns error if content larger

Loads media content from any source and returns base64-encoded data.

func (l *MediaLoader) GetBase64Data(
ctx context.Context,
media *types.MediaContent,
) (string, error)

Parameters:

  • ctx - Context for cancellation and timeout
  • media - MediaContent with one or more sources

Returns:

  • string - Base64-encoded media data
  • error - Load errors (not found, timeout, size limit, etc.)

Source Priority:

Media is loaded from the first available source in this order:

  1. Data - Inline base64 data (if present)
  2. StorageReference - External storage (requires StorageService)
  3. FilePath - Local file system path
  4. URL - HTTP/HTTPS URL (with timeout and size limits)

Example:

// Load from any source
data, err := loader.GetBase64Data(ctx, media)
if err != nil {
log.Printf("Failed to load media: %v", err)
return err
}
// Use the data
fmt.Printf("Loaded %d bytes\n", len(data))
// Media with inline data
media := &types.MediaContent{
Type: "image",
MimeType: "image/png",
Data: "iVBORw0KGgoAAAANSUhEUg...", // Base64
}
data, err := loader.GetBase64Data(ctx, media)
// Returns media.Data immediately (already inline)
// Media externalized to storage
media := &types.MediaContent{
Type: "image",
MimeType: "image/png",
StorageReference: &storage.StorageReference{
ID: "abc123-def456-ghi789",
Backend: "file",
},
}
data, err := loader.GetBase64Data(ctx, media)
// Loads from disk via StorageService
// Media from local file
media := &types.MediaContent{
Type: "image",
MimeType: "image/jpeg",
FilePath: "/path/to/image.jpg",
}
data, err := loader.GetBase64Data(ctx, media)
// Reads file and converts to base64
// Media from HTTP URL
media := &types.MediaContent{
Type: "image",
MimeType: "image/png",
URL: "https://example.com/image.png",
}
data, err := loader.GetBase64Data(ctx, media)
// Fetches URL with timeout and size checks
// Provider using MediaLoader
type MyProvider struct {
mediaLoader *providers.MediaLoader
}
func (p *MyProvider) Predict(ctx context.Context, req providers.PredictionRequest) (providers.PredictionResponse, error) {
// Load media from messages
for _, msg := range req.Messages {
for _, content := range msg.Content {
if content.Media != nil {
// Load media transparently
data, err := p.mediaLoader.GetBase64Data(ctx, content.Media)
if err != nil {
return providers.PredictionResponse{}, err
}
// Use data in API call
// ...
}
}
}
// Call LLM API
// ...
}

MediaLoader returns specific errors:

data, err := loader.GetBase64Data(ctx, media)
if err != nil {
switch {
case errors.Is(err, providers.ErrNoMediaSource):
// No source available (no Data, StorageReference, FilePath, or URL)
case errors.Is(err, providers.ErrMediaNotFound):
// Storage reference or file path not found
case errors.Is(err, providers.ErrMediaTooLarge):
// URL content exceeds MaxURLSizeBytes
case errors.Is(err, context.DeadlineExceeded):
// HTTP timeout or context cancelled
default:
// Other errors (permission, network, etc.)
}
}

MediaLoader does not cache loaded media. For repeated access:

// Cache loaded media yourself
mediaCache := make(map[string]string)
data, ok := mediaCache[media.StorageReference.ID]
if !ok {
var err error
data, err = loader.GetBase64Data(ctx, media)
if err != nil {
return err
}
mediaCache[media.StorageReference.ID] = data
}

For loading multiple media items in parallel:

type loadResult struct {
data string
err error
}
// Load media concurrently
results := make([]loadResult, len(mediaItems))
var wg sync.WaitGroup
for i, media := range mediaItems {
wg.Add(1)
go func(idx int, m *types.MediaContent) {
defer wg.Done()
data, err := loader.GetBase64Data(ctx, m)
results[idx] = loadResult{data, err}
}(i, media)
}
wg.Wait()
// Check results
for i, result := range results {
if result.err != nil {
log.Printf("Failed to load media %d: %v", i, result.err)
}
}

✅ Do:

  • Create one MediaLoader per application (reuse)
  • Set reasonable HTTP timeout (30s is good default)
  • Set MaxURLSizeBytes to prevent abuse
  • Handle errors gracefully (media may be unavailable)
  • Use context for cancellation support

❌ Don’t:

  • Don’t create MediaLoader per request (expensive)
  • Don’t ignore errors (media may be corrupted/missing)
  • Don’t set timeout too low (large images take time)
  • Don’t allow unlimited URL sizes (DoS risk)
  • Don’t cache without bounds (memory leak)

MediaLoader and media storage work together:

// 1. Set up storage
fileStore := local.NewFileStore(local.FileStoreConfig{
BaseDir: "./media",
})
// 2. Create loader with storage
loader := providers.NewMediaLoader(providers.MediaLoaderConfig{
StorageService: fileStore,
})
// 3. Use in SDK
manager, _ := sdk.NewConversationManager(
sdk.WithProvider(provider),
sdk.WithMediaStorage(fileStore), // Externalizes media
)
// 4. Provider automatically uses loader
// Media externalized by MediaExternalizer middleware
// Provider loads via MediaLoader when needed
// Application code remains unchanged

Flow:

1. User sends image → inline Data
2. LLM returns generated image → inline Data
3. MediaExternalizer → externalizes to storage, clears Data, adds StorageReference
4. State saved → only reference in Redis/Postgres
5. Next turn: Provider needs image → MediaLoader loads from StorageReference
6. Transparent to application code