Ollama Local LLM Example
This example demonstrates how to use PromptArena with Ollama for local LLM inference. No API keys required!
Prerequisites
Section titled “Prerequisites”- Docker and Docker Compose installed (for local) OR Ollama running on Kubernetes
- PromptArena CLI (
arena) installed
Quick Start
Section titled “Quick Start”Option A: Local Docker Setup
Section titled “Option A: Local Docker Setup”1. Start Ollama with Docker Compose
Section titled “1. Start Ollama with Docker Compose”cd examples/ollama-localdocker compose up -dThis will:
- Start the Ollama server on port 11434
- Automatically pull the
llama3.2:1bmodel (small, ~1.3GB)
Wait for the model to download (check with docker compose logs -f ollama-pull).
2. Verify Ollama is Running
Section titled “2. Verify Ollama is Running”curl http://localhost:11434/api/tagsYou should see the llama3.2:1b model listed.
3. Run the Arena Tests
Section titled “3. Run the Arena Tests”arena run config.arena.yamlOption B: Kubernetes Setup
Section titled “Option B: Kubernetes Setup”If you have Ollama running on Kubernetes:
1. Set the Ollama endpoint
Section titled “1. Set the Ollama endpoint”export OLLAMA_BASE_URL=http://ollama.default.svc.cluster.local:11434# Or use port-forward for local testing:# kubectl port-forward svc/ollama 11434:11434# export OLLAMA_BASE_URL=http://localhost:114342. Run the Kubernetes config
Section titled “2. Run the Kubernetes config”arena run config.arena.k8s.yamlConfiguration
Section titled “Configuration”Provider Configuration
Section titled “Provider Configuration”The Ollama provider is configured in providers/ollama-llama.provider.yaml:
spec: type: ollama model: llama3.2:1b base_url: "http://localhost:11434" additional_config: keep_alive: "5m" # Keep model loaded for 5 minutesFor Kubernetes, use providers/ollama-k8s.provider.yaml which supports environment variable substitution:
spec: type: ollama model: llama3.2:1b base_url: "${OLLAMA_BASE_URL:-http://ollama.default.svc.cluster.local:11434}"Using Different Models
Section titled “Using Different Models”To use a different model:
-
Pull the model:
Terminal window # Local Dockerdocker compose exec ollama ollama pull <model-name># Kuberneteskubectl exec -it deployment/ollama -- ollama pull <model-name> -
Update the provider config:
model: <model-name>
Popular models:
llama3.2:1b- Smallest, fastest (~1.3GB)llama3.2:3b- Good balance (~2GB)llama3.1:8b- Better quality (~4.7GB)mistral:7b- Strong performance (~4.1GB)codellama:7b- Optimized for code (~3.8GB)llava:7b- Vision + language (~4.5GB)
GPU Acceleration
Section titled “GPU Acceleration”For NVIDIA GPU support, uncomment the GPU section in docker-compose.yaml:
deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu]Scenarios
Section titled “Scenarios”basic-chat
Section titled “basic-chat”Simple conversation testing basic Q&A capabilities.
code-generation
Section titled “code-generation”Tests code generation in Python and Go.
streaming-verification
Section titled “streaming-verification”Validates that streaming responses work correctly:
- Tests long-form content generation with streaming
- Verifies chunk boundaries with multi-paragraph responses
- Tests code block streaming
- Validates short responses after long ones
multimodal-vision (requires vision model)
Section titled “multimodal-vision (requires vision model)”Tests image analysis capabilities with vision models like LLaVA:
- Basic image description
- Context retention across turns
- Multi-image comparison
- Detailed structured analysis
To enable multimodal testing:
-
Pull a vision model:
Terminal window # Localdocker compose exec ollama ollama pull llava:7b# Kuberneteskubectl exec -it deployment/ollama -- ollama pull llava:7b -
Uncomment the vision provider and scenario in
config.arena.yaml:providers:- file: providers/ollama-llama.provider.yaml- file: providers/ollama-vision-local.provider.yaml # Uncommentscenarios:- file: scenarios/basic-chat.scenario.yaml- file: scenarios/code-generation.scenario.yaml- file: scenarios/streaming-verification.scenario.yaml- file: scenarios/multimodal-vision.scenario.yaml # Uncomment -
Run with the vision provider:
Terminal window arena run config.arena.yaml --provider ollama-vision-local
Running Specific Scenarios
Section titled “Running Specific Scenarios”Run individual scenarios:
# Just streaming testsarena run config.arena.yaml --scenario streaming-verification
# Just multimodal tests (requires vision model)arena run config.arena.yaml --scenario multimodal-vision --provider ollama-vision-localCleanup
Section titled “Cleanup”docker compose down -v # -v removes the volume with downloaded modelsTroubleshooting
Section titled “Troubleshooting”Model not found
Section titled “Model not found”docker compose exec ollama ollama pull llama3.2:1bSlow responses
Section titled “Slow responses”The first request loads the model into memory. Subsequent requests are faster.
Use keep_alive to keep the model loaded between requests.
Out of memory
Section titled “Out of memory”Try a smaller model like llama3.2:1b or phi3:mini.
Streaming not working
Section titled “Streaming not working”Ensure the streaming: true setting is in your config defaults or scenario.
Check Ollama logs for any SSE-related errors:
# Localdocker compose logs ollama
# Kuberneteskubectl logs deployment/ollamaMultimodal/Vision errors
Section titled “Multimodal/Vision errors”- Ensure you’re using a vision model (
llava:7b,bakllava:7b,llama3.2-vision:11b) - Standard text models like
llama3.2:1bdo not support image inputs - Check that test images exist at the referenced paths
Pipeline timeouts (30s limit)
Section titled “Pipeline timeouts (30s limit)”The arena pipeline has a 30-second timeout per turn. For cold starts (first request after model swap):
- Run with
-j 1(single concurrency) to keep models warm - Warm up the model before tests:
curl http://localhost:11434/api/generate -d '{"model": "llava:7b", "prompt": "Hi", "stream": false}' - Keep prompts concise to reduce response time
Known Limitations
Section titled “Known Limitations”Token/Cost Metrics Not Available
Section titled “Token/Cost Metrics Not Available”Ollama’s streaming API does not return token usage information in streaming mode. This means:
- Token counts will show as 0 in the HTML report
- Cost estimates will show as $0.0000
- Per-turn latency is tracked correctly (measured at the pipeline level)
This is a limitation of the Ollama API, not PromptArena. For token-based metrics, consider using providers that return usage information (OpenAI, Anthropic, etc.) or use Ollama’s non-streaming API (not recommended for arena testing).