Context Management Example
Context Management Example
Section titled “Context Management Example”This example demonstrates Arena’s context management capabilities for handling long conversations with token budget constraints.
Overview
Section titled “Overview”Context management prevents conversations from exceeding provider token limits by intelligently truncating or managing message history. This is critical for:
- Long conversations: 20+ turn conversations
- Cost optimization: Reduce tokens sent to provider
- Provider limits: Respect context window sizes (GPT-4: 128k, Claude: 200k)
- Realistic testing: Test behavior under production constraints
Configuration
Section titled “Configuration”Context management is configured in the scenario YAML:
context_policy: token_budget: 50000 # Max tokens for entire context reserve_for_output: 4000 # Reserve tokens for response strategy: "oldest" # Truncation strategy cache_breakpoints: true # Enable prompt caching (Anthropic)Token Budget
Section titled “Token Budget”token_budget: Maximum tokens for full context (system prompt + messages)reserve_for_output: Tokens reserved for model response- Available for messages:
token_budget - reserve_for_output - system_prompt_tokens
Truncation Strategies
Section titled “Truncation Strategies”-
oldest(default): Drop oldest messages first- Simple and predictable
- Keeps recent context
- Best for most use cases
-
fail: Error if budget exceeded- Strict mode for testing
- Ensures no data loss
- Good for validation
-
summarize(future): Compress old messages- Uses LLM to create summaries
- Preserves more information
- Higher latency
-
relevance(future): Drop least relevant messages- Uses embeddings for relevance scoring
- Keeps important context
- Requires embedding model
Cache Breakpoints (Anthropic Only)
Section titled “Cache Breakpoints (Anthropic Only)”When cache_breakpoints: true, Arena inserts cache markers for Anthropic’s prompt caching:
- System prompt is marked for caching
- Subsequent turns reuse cached prompt
- 90% cost reduction on cached tokens
- Only works with Anthropic Claude models
Scenarios
Section titled “Scenarios”1. Unlimited Context (Baseline)
Section titled “1. Unlimited Context (Baseline)”File: scenarios/context-unlimited.yaml
No context policy = unlimited context (backward compatible).
Purpose: Baseline to compare against limited scenarios.
# No context_policy specifiedturns: - role: user content: "First message..." # ... many turns - role: user content: "What did I first ask?" # Full context available2. Limited with Oldest Strategy
Section titled “2. Limited with Oldest Strategy”File: scenarios/context-limited-oldest.yaml
Very low budget (500 tokens) to force truncation.
Purpose: Verify oldest messages are dropped when over budget.
Expected behavior:
- Early turns (1-4) are dropped
- Recent turns (5-7) are kept
- Last turn asking “What’s my name?” should fail (name was in turn 1)
3. Fail on Budget Exceeded
Section titled “3. Fail on Budget Exceeded”File: scenarios/context-limited-fail.yaml
Strict mode with strategy: "fail".
Purpose: Verify execution errors when budget exceeded.
Expected behavior:
- First few turns succeed
- Later turn triggers error: “token budget exceeded”
- No truncation occurs
4. With Caching (Anthropic)
Section titled “4. With Caching (Anthropic)”File: scenarios/context-with-caching.yaml
Enable Anthropic prompt caching with cache_breakpoints: true.
Purpose: Verify cache breakpoints reduce costs.
Expected behavior:
- Turn 1: Full cost (cache miss)
- Turn 2+: Reduced cost (cache hit on system prompt)
- Cost breakdown shows cached tokens
Running the Example
Section titled “Running the Example”# Run all scenarioscd examples/context-managementpromptarena run arena.yaml
# Run specific scenariopromptarena run arena.yaml --scenario context-limited-oldest
# Run with specific providerpromptarena run arena.yaml --provider anthropic-claude-sonnetExpected Output
Section titled “Expected Output”Unlimited Context
Section titled “Unlimited Context”Scenario: context-unlimited✓ Turn 1: "Tell me about the solar system"✓ Turn 2: "What are the inner planets?"...✓ Turn 5: "What did I first ask you about?" Response: "You first asked about the solar system." Context: 5/5 messages kept Cost: $0.0234Limited with Oldest Strategy
Section titled “Limited with Oldest Strategy”Scenario: context-limited-oldest⚠ Turn 1-4: Dropped (over budget)✓ Turn 5: "What about Saturn's rings?"✓ Turn 6: "Which planet has the most moons?"✓ Turn 7: "What's my name again?" Response: "I don't have that information in our conversation." Context: 3/7 messages kept (4 dropped) Cost: $0.0089 (62% reduction)Fail on Budget Exceeded
Section titled “Fail on Budget Exceeded”Scenario: context-limited-fail✓ Turn 1: "Tell me about the solar system"✓ Turn 2: "What are all the planets?"✗ Turn 3: Error - token budget exceeded: have 387, budget 300 Context: Failed before execution Cost: $0.0056With Caching
Section titled “With Caching”Scenario: context-with-caching✓ Turn 1: "What are the planets?" Cost: $0.0124 (0 cached)✓ Turn 2: "Tell me about Mercury" Cost: $0.0018 (1,234 cached) - 85% reduction✓ Turn 3: "What about Venus?" Cost: $0.0019 (1,234 cached) - 85% reduction...Total Cost: $0.0234 (avg 73% reduction from caching)Implementation Details
Section titled “Implementation Details”How It Works
Section titled “How It Works”- Configuration: Scenario specifies
context_policy - Pipeline Integration: Context middleware inserted before Provider middleware
- Token Counting: Simple word-based estimator (words * 1.3)
- Truncation: Applied before each turn execution
- Metadata: Truncation info stored in execution context
Pipeline Order
Section titled “Pipeline Order”Template Middleware ↓Context Middleware (NEW) ← Token budget enforcement ↓Provider Middleware ↓Validator MiddlewareToken Counting
Section titled “Token Counting”Current implementation uses simple word-based estimation:
- Split text into words
- Multiply by 1.3 (accounts for subword tokens)
- Not accurate but good enough for testing
For production, use:
tiktokenfor OpenAI models- Anthropic tokenizer for Claude
- Provider-specific tokenizers
Observability
Section titled “Observability”Context truncation is tracked in execution metadata:
execCtx.Metadata["context_truncated"] = trueexecCtx.Metadata["context_original_count"] = 7execCtx.Metadata["context_truncated_count"] = 3execCtx.Metadata["context_dropped_count"] = 4This appears in Arena output:
Context Management: Original: 7 messages Kept: 3 messages Dropped: 4 messages Strategy: oldest Budget: 500 tokensCost Comparison
Section titled “Cost Comparison”Expected cost differences:
| Scenario | Tokens | Cost | vs Unlimited |
|---|---|---|---|
| Unlimited | ~2,500 | $0.0234 | baseline |
| Limited (oldest) | ~950 | $0.0089 | -62% |
| With Caching (turn 1) | 1,234 | $0.0124 | -47% |
| With Caching (turn 2+) | 145 + 1,234 cached | $0.0018 | -92% |
Future Enhancements
Section titled “Future Enhancements”- Summarization Strategy: Use LLM to compress old messages
- Relevance Strategy: Use embeddings to keep relevant messages
- Accurate Token Counting: Use tiktoken for OpenAI
- Per-Turn Budget: Override budget for specific turns
- Dynamic Budget: Adjust based on response needs
Testing Context Management
Section titled “Testing Context Management”Use this example to test:
- Truncation Logic: Does oldest strategy work correctly?
- Budget Enforcement: Does fail strategy error appropriately?
- Cost Reduction: Do limited scenarios save money?
- Caching: Does Anthropic caching reduce costs?
- Context Loss: Do models handle missing context gracefully?