Judges¶
A judge is the VLM backend that powers VLM-based checks. evalmedia is backend-agnostic — choose based on your cost/quality/latency tradeoffs.
Supported Judges¶
Claude (Anthropic)¶
Recommended default for quality.
import evalmedia
evalmedia.set_judge("claude", api_key="sk-ant-...")
# Or via environment variable
# EVALMEDIA_ANTHROPIC_API_KEY=sk-ant-...
Default model: claude-sonnet-4-20250514. Override with:
OpenAI (GPT-4.1)¶
Strong vision capabilities at reasonable cost.
import evalmedia
evalmedia.set_judge("openai", api_key="sk-...")
# Or via environment variable
# EVALMEDIA_OPENAI_API_KEY=sk-...
Default model: gpt-4.1. Override with:
Configuration Hierarchy¶
API keys and judge settings are resolved in this order (highest priority first):
- Function arguments —
set_judge("claude", api_key="...") - Environment variables —
EVALMEDIA_ANTHROPIC_API_KEY - Defaults — Claude with no API key (will error on first call)
Per-Check Judge Override¶
You can use different judges for different checks:
from evalmedia.checks.image import FaceArtifacts, PromptAdherence
# Use Claude for face detection (higher quality)
face_check = FaceArtifacts(judge="claude")
# Use OpenAI for prompt adherence (lower cost)
prompt_check = PromptAdherence(judge="openai")
How Judges Work¶
Each VLM check sends:
- The image (as base64)
- A check-specific evaluation prompt with scoring criteria
- A system prompt requesting structured JSON output
The judge returns:
evalmedia parses this into a JudgeResponse with multiple fallback strategies (raw JSON, fenced code blocks, regex extraction).
Retry Logic¶
Both judges include automatic retry with exponential backoff for transient errors (timeouts, connection errors). Configure via: