vLLM Local Inference Example
This example demonstrates how to use PromptArena with vLLM for high-performance local LLM inference. vLLM is a fast and memory-efficient inference engine that provides OpenAI-compatible API endpoints.
Prerequisites
Section titled “Prerequisites”- Docker and Docker Compose installed (for local setup) OR vLLM running on Kubernetes
- PromptArena CLI (
arena) installed - NVIDIA GPU with CUDA support (recommended for performance, CPU mode available but slower)
Quick Start
Section titled “Quick Start”Option A: Local Docker Setup
Section titled “Option A: Local Docker Setup”1. Start vLLM with Docker Compose
Section titled “1. Start vLLM with Docker Compose”cd examples/vllm-localdocker compose up -dThis will:
- Start the vLLM server on port 8000
- Automatically download and serve the
facebook/opt-125mmodel (small, ~250MB for testing) - Expose an OpenAI-compatible API endpoint
Note: The first startup will take some time to download the model. You can monitor progress with:
docker compose logs -f vllm2. Verify vLLM is Running
Section titled “2. Verify vLLM is Running”curl http://localhost:8000/v1/modelsYou should see the model listed in the response.
3. Run the Arena Tests
Section titled “3. Run the Arena Tests”arena run config.arena.yamlThis will:
- Execute test scenarios against the local vLLM instance
- Compare responses from the model
- Generate test results in the
out/directory
Option B: Kubernetes Setup
Section titled “Option B: Kubernetes Setup”If you have vLLM running on Kubernetes:
1. Set the vLLM endpoint
Section titled “1. Set the vLLM endpoint”export VLLM_BASE_URL=http://vllm.default.svc.cluster.local:8000# Or use port-forward for local testing:# kubectl port-forward svc/vllm 8000:80002. Run with Kubernetes config
Section titled “2. Run with Kubernetes config”arena run config.arena.k8s.yamlConfiguration Files
Section titled “Configuration Files”Providers
Section titled “Providers”providers/vllm-local.provider.yaml- Basic vLLM provider for local Docker setupproviders/vllm-k8s.provider.yaml- vLLM provider for Kubernetes deploymentproviders/vllm-guided-decoding.provider.yaml- Example with guided decoding for structured output
Prompts
Section titled “Prompts”prompts/assistant.prompt.yaml- General assistant promptprompts/code-helper.prompt.yaml- Code generation assistant
Scenarios
Section titled “Scenarios”scenarios/basic-chat.scenario.yaml- Simple conversation testscenarios/code-generation.scenario.yaml- Code generation scenariosscenarios/guided-output.scenario.yaml- Structured output with guided decoding
Using Different Models
Section titled “Using Different Models”Small Models (Good for Testing)
Section titled “Small Models (Good for Testing)”# In providers/vllm-local.provider.yamlspec: model: facebook/opt-125m # ~250MB # or model: facebook/opt-1.3b # ~2.5GBProduction Models
Section titled “Production Models”spec: model: meta-llama/Llama-2-7b-chat-hf # ~13GB # or model: mistralai/Mistral-7B-Instruct-v0.1 # ~14GB # or model: meta-llama/Llama-2-13b-chat-hf # ~26GB (requires more GPU memory)Note: Larger models require more GPU memory. Ensure your GPU has sufficient VRAM:
- 7B models: ~14GB VRAM
- 13B models: ~26GB VRAM
- 70B models: Multiple GPUs with tensor parallelism
Update the docker-compose.yaml
Section titled “Update the docker-compose.yaml”services: vllm: command: - --model - meta-llama/Llama-2-7b-chat-hf # Change this - --dtype - autoAdvanced Features
Section titled “Advanced Features”Guided Decoding for Structured Output
Section titled “Guided Decoding for Structured Output”vLLM supports guided decoding to ensure outputs match specific formats (JSON Schema, regex, grammar, or choices):
# providers/vllm-guided-decoding.provider.yamlspec: additional_config: guided_json: type: object properties: sentiment: type: string enum: ["positive", "negative", "neutral"] confidence: type: number minimum: 0 maximum: 1 required: ["sentiment", "confidence"]See scenarios/guided-output.scenario.yaml for a complete example.
Beam Search
Section titled “Beam Search”Enable beam search for potentially better quality outputs:
spec: additional_config: use_beam_search: true best_of: 3Authentication (Optional)
Section titled “Authentication (Optional)”If your vLLM instance requires authentication:
spec: additional_config: api_key: "${VLLM_API_KEY}"Then set the environment variable:
export VLLM_API_KEY="your-api-key"Performance Optimization
Section titled “Performance Optimization”GPU Configuration
Section titled “GPU Configuration”For better performance with larger models, you can configure vLLM’s GPU settings in docker-compose.yaml:
services: vllm: command: - --model - meta-llama/Llama-2-7b-chat-hf - --tensor-parallel-size - "2" # Use 2 GPUs - --gpu-memory-utilization - "0.9" # Use 90% of GPU memory deploy: resources: reservations: devices: - driver: nvidia count: 2 # Number of GPUs capabilities: [gpu]Connection Pooling
Section titled “Connection Pooling”For high-throughput scenarios, PromptKit automatically uses connection pooling. You can run parallel tests:
arena run config.arena.yaml --parallel 10Cost Tracking
Section titled “Cost Tracking”vLLM is self-hosted, so costs are $0 by default. However, you can configure custom costs for internal accounting:
spec: pricing: input_cost_per_1k: 0.001 # $0.001 per 1K input tokens output_cost_per_1k: 0.002 # $0.002 per 1K output tokensTroubleshooting
Section titled “Troubleshooting”vLLM Not Starting
Section titled “vLLM Not Starting”Check GPU availability:
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smiModel Download Issues
Section titled “Model Download Issues”vLLM downloads models from HuggingFace. If you have authentication issues:
export HUGGING_FACE_HUB_TOKEN="your-token"And update docker-compose.yaml:
services: vllm: environment: - HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}Out of Memory
Section titled “Out of Memory”Reduce model size or GPU memory utilization:
command: - --gpu-memory-utilization - "0.7" # Reduce from default 0.9Connection Refused
Section titled “Connection Refused”Ensure vLLM is fully started:
docker compose logs vllm# Wait for: "Application startup complete"Resources
Section titled “Resources”Next Steps
Section titled “Next Steps”- Try different models from HuggingFace
- Implement custom scenarios for your use cases
- Compare vLLM performance with cloud providers using Arena
- Set up vLLM on Kubernetes for production workloads