Test RAG agents with the standard primitive suite
This how-to walks through examples/rag-agent/ — the full named RAG primitive suite (faithfulness, answer_relevancy, contextual_precision, contextual_recall, contextual_relevancy, hallucination) exercised as scenario assertions. The vocabulary buyers from DeepEval / Ragas expect, wired in PromptArena.
What it proves
Section titled “What it proves”RAG eval frameworks compete on the primitive catalog: faithfulness, answer relevancy, contextual recall, hallucination. PromptArena ships those primitives in runtime/evals/handlers/ (added in #1145) as thin wrappers over llm_judge with hardened default prompts adapted from public DeepEval / Ragas references (Apache 2.0).
Each primitive is invokable as a scenario assertion with min_score. The demo runs all six against a single question + answer + retrieved context, with a mock LLM judge for keyless CI.
The assertion shape
Section titled “The assertion shape”turns: - role: user content: "What is the capital of France?" assertions: - type: faithfulness params: contexts: - "Paris is the capital and most populous city of France." - "Located on the Seine River in north-central France." judge: rag-judge min_score: 0.8
- type: answer_relevancy params: judge: rag-judge min_score: 0.8
- type: contextual_precision params: contexts: [...] judge: rag-judge min_score: 0.5
# ... contextual_recall, contextual_relevancy, hallucinationAll six assertions share the same shape — the eval handler does the work; the assertion config supplies thresholds and (for the chunk-based primitives) the context.
Three context sources
Section titled “Three context sources”Every RAG handler accepts retrieved context in three forms (preference order):
contexts: [...]— the canonical inline list (used in the demo).context: "..."— single-chunk shorthand.context_field: <metadata-key>— look up the chunks fromevalCtx.Metadata. Use this when a retrieval tool writes results to metadata at runtime; the assertion reads them back.
For a live RAG agent, the dynamic context_field form is the right shape:
- type: faithfulness params: context_field: retrieved_chunks judge: rag-judge min_score: 0.8Wire your retrieval tool to set metadata["retrieved_chunks"] on each turn and the assertion auto-picks-up the chunks.
Run it
Section titled “Run it”cd examples/rag-agentpromptarena serveHeadless / CI:
promptarena run --ci --formats html,jsonopen out/report.htmlKeyless: both the RAG assistant and the LLM judge are mock providers. The mock judge returns {"passed": true, "score": 0.92, "reasoning": "..."} for every call; all six assertions pass with the default min_score thresholds.
Swapping in a real judge
Section titled “Swapping in a real judge”Replace the mock judge with a real LLM provider:
# providers/openai-judge.yamlapiVersion: promptkit.altairalabs.ai/v1alpha1kind: Providermetadata: name: openai-judgespec: id: openai-judge type: openai model: gpt-4o-miniUpdate judges: in config.arena.yaml:
judges: - name: rag-judge provider: openai-judgeRun with OPENAI_API_KEY set. The mock assistant can stay or get swapped too — the assertions don’t care which provider produced the answer.
CI gate
Section titled “CI gate”# .github/workflows/rag-agent.ymlname: RAG agent
on: pull_request: paths: - 'examples/rag-agent/**'
jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-go@v5 with: go-version: '1.26' - run: make build-arena - name: Run RAG scenarios working-directory: examples/rag-agent run: ../../bin/promptarena run --ci --formats jsonThe default config is keyless. Swap the mock judge for a real one when you want to grade real outputs.
Naming and credit
Section titled “Naming and credit”The handler default prompts are adapted from public DeepEval and Ragas reference implementations (Apache 2.0). Attribution lives in each handler’s docstring. The name choices (faithfulness, answer_relevancy, contextual_*, hallucination) match the buyer-facing vocabulary in the comparison-sheet bake-offs.
Related how-tos
Section titled “Related how-tos”- The Checks Reference has the full parameter list and surface notes for each RAG primitive.
- The Validate Outputs how-to covers the broader assertion mechanism.