LLM-as-Judge: Automating Evaluation at Scale
LLM-as-judge is a pattern where you use a capable language model (the judge) to score the quality of outputs from another model (the target model being evaluated). Instead of manually rating thousands of examples or waiting for human feedback, an LLM judge can evaluate outputs in seconds. In 2026, this is standard practice: Anthropic, OpenAI, and Google all use multi-judge ensembles to evaluate their models. A single LLM judge achieves 50–70% agreement with human judgment on open-ended tasks; multiple judges in consensus reach 85–95% agreement.
The key insight: LLM judges are fast and scalable but biased and fallible. A judge might favor verbose outputs or exhibit preference drift over time. This article teaches you to design judge instructions that minimize bias, implement multi-judge ensembles for reliability, calibrate thresholds on your golden dataset, and integrate judges into your evaluation pipeline at scale.
Designing Judge Instructions
An LLM judge needs clear, unambiguous instructions to evaluate outputs consistently. The instruction is everything: a vague prompt produces inconsistent scoring; a well-designed prompt with examples produces reliable scoring.
def create_judge_prompt(
question: str,
reference_answer: str,
target_output: str,
task_description: str
) -> str:
"""
Construct a judge prompt: task description, criteria, examples, request.
Format: task context → criteria → target output → reference → request for score.
"""
prompt = f"""You are an expert evaluator assessing the quality of a model's response.
TASK DESCRIPTION:
{task_description}
EVALUATION CRITERIA:
1. Correctness: Does the response answer the question accurately?
2. Completeness: Does it address all parts of the question?
3. Clarity: Is the explanation clear and well-organized?
4. Relevance: Does it focus on the question without irrelevant tangents?
QUESTION:
{question}
REFERENCE ANSWER (if available):
{reference_answer}
MODEL'S RESPONSE:
{target_output}
Evaluate the model's response on each criterion above (1–10 scale, where 10 is excellent).
Then provide an overall quality score (1–10).
Format your response as JSON:
{{
"correctness": <1-10>,
"completeness": <1-10>,
"clarity": <1-10>,
"relevance": <1-10>,
"overall_score": <1-10>,
"explanation": "<brief summary of strengths and weaknesses>"
}}
"""
return prompt
Structure matters. Lead with the task context (what problem are we solving?), then criteria (what makes a good answer?), then the example being evaluated. This order matches how humans approach evaluation. Include explicit scales and definitions: what does a "7" look like vs. a "9"?
Add exemplars—examples of different quality levels with their expected scores. This dramatically improves consistency.
def create_judge_with_exemplars(task_description: str) -> str:
"""Judge prompt with few-shot examples of expected scores."""
prompt = f"""You are evaluating the quality of a summarization system.
TASK: Summarize a news article in 1–2 sentences, capturing the key facts.
EVALUATION CRITERIA:
- Accuracy: No factual errors or hallucinations.
- Completeness: All critical facts included.
- Conciseness: Minimal fluff, maximum information density.
EXEMPLAR 1 (Score: 9)
Article: "Tesla announced a 4.2% price increase effective March 15..."
Summary: "Tesla increased prices by 4.2% starting March 15 due to supply chain pressures."
Judge explanation: Accurate, all facts included, concise.
EXEMPLAR 2 (Score: 6)
Article: "Tesla announced a 4.2% price increase..."
Summary: "Tesla did something with prices recently."
Judge explanation: Vague, lacks specific facts (percentage, date), incomplete.
NOW EVALUATE THIS OUTPUT:
Article: {task_description['article']}
Summary: {task_description['summary_to_evaluate']}
Provide JSON: {{"score": <1-10>, "explanation": "..."}}
"""
return prompt
Test your judge prompt on your golden dataset. Compare the judge's scores to human annotations. Iterate until agreement is above 0.75 (Spearman rank correlation).
Multi-Judge Ensembles
A single judge can be biased or inconsistent. The solution: run multiple judges and aggregate their scores.
async def multi_judge_evaluation(
question: str,
target_output: str,
reference: str,
judges: List[str] = ['claude-opus', 'gpt-4', 'llama-70b'],
temperature: float = 0.3 # Lower temp for consistency
) -> dict:
"""
Run the same example through multiple judges.
judges: list of model names to use as judges.
Returns: individual scores, aggregate, consensus.
"""
import asyncio
from anthropic import Anthropic
client = Anthropic()
judge_scores = {}
async def run_single_judge(judge_name: str):
prompt = create_judge_prompt(
question=question,
reference_answer=reference,
target_output=target_output,
task_description="QA Evaluation"
)
# In production: use actual API client for each judge model
# This is a simplified mock
response = client.messages.create(
model=judge_name,
max_tokens=500,
temperature=temperature,
messages=[{"role": "user", "content": prompt}]
)
return judge_name, response.content[0].text
# Run all judges in parallel
tasks = [run_single_judge(judge) for judge in judges]
results = await asyncio.gather(*tasks)
# Parse and aggregate scores
scores = []
explanations = []
for judge_name, response_text in results:
try:
import json
parsed = json.loads(response_text)
scores.append(parsed.get('overall_score', 0))
explanations.append({
'judge': judge_name,
'score': parsed['overall_score'],
'explanation': parsed.get('explanation', '')
})
except json.JSONDecodeError:
# Judge didn't return valid JSON; manual parsing or log error
pass
# Aggregate: mean, median, std dev
import statistics
return {
'individual_scores': scores,
'mean_score': statistics.mean(scores) if scores else 0,
'median_score': statistics.median(scores) if scores else 0,
'std_dev': statistics.stdev(scores) if len(scores) > 1 else 0,
'consensus': all(abs(s - statistics.mean(scores)) < 1.5 for s in scores),
'judge_explanations': explanations
}
Three judges are a good starting point: diversity of perspective and computational cost trade-off. If judges disagree (std dev > 2), flag the example for manual review. If two judges agree and one disagrees, investigate why.
Judge Bias and Calibration
LLM judges exhibit systematic biases: they might favor certain writing styles, be reluctant to give low scores, or prefer their own model's outputs. Calibration mitigates this.
def calibrate_judge_on_golden_dataset(
judge_model: str,
golden_examples: List[dict],
human_scores: List[int]
) -> dict:
"""
Run judge on golden dataset, compare to human judgment.
Returns: calibration metrics, bias detection, threshold recommendations.
"""
judge_scores = []
for example in golden_examples:
prompt = create_judge_prompt(
question=example['input'],
reference_answer=example['reference'],
target_output=example['model_output'],
task_description=example['task']
)
# Evaluate using judge (mocked here)
response = "placeholder_judge_response" # Call actual LLM
judge_scores.append(extract_score(response))
# Compare to human judgment
import numpy as np
from scipy.stats import spearmanr, pearsonr
human_scores = np.array(human_scores)
judge_scores_arr = np.array(judge_scores)
spearman_corr, spearman_p = spearmanr(human_scores, judge_scores_arr)
pearson_corr, pearson_p = pearsonr(human_scores, judge_scores_arr)
# Detect bias: judge systematically over/under-scores
bias = np.mean(judge_scores_arr) - np.mean(human_scores)
return {
'spearman_correlation': spearman_corr,
'pearson_correlation': pearson_corr,
'bias': bias, # Positive = overly generous judge
'std_dev_judge': np.std(judge_scores_arr),
'std_dev_human': np.std(human_scores),
'recommendation': 'acceptable' if spearman_corr > 0.70 else 'retrain'
}
Bias is systematic: if a judge's mean score is 7.2 but human mean is 6.1, subtract 1.1 from all future judge scores. This simple bias correction can lift agreement from 0.65 to 0.75+.
Integration with Your Evaluation Pipeline
LLM judges work best as a second-stage filter: first, run deterministic checks and fast metrics. Only examples that pass first-stage go to the judge.
def evaluate_with_llm_judge(
model_output: str,
question: str,
reference: str,
deterministic_checks: dict,
judge_threshold: float = 7.0
) -> dict:
"""
Evaluation pipeline: deterministic → metrics → judge.
Short-circuit if deterministic checks fail.
"""
# Stage 1: Deterministic validation
if not deterministic_checks['passed_all']:
return {
'stage': 'deterministic',
'passed': False,
'reason': deterministic_checks['checks'],
'needs_judge': False
}
# Stage 2: Fast metrics (ROUGE, exact-match, semantic similarity)
fast_metrics = {
'rouge_l': compute_rouge_l(model_output, reference),
'semantic_sim': compute_semantic_similarity(model_output, reference)
}
# Stage 3: LLM judge only if fast metrics borderline
# (e.g., 0.4 < semantic_sim < 0.7 = uncertain)
if 0.4 <= fast_metrics['semantic_sim'] <= 0.7:
judge_scores = multi_judge_evaluation(
question=question,
target_output=model_output,
reference=reference,
judges=['claude-opus', 'gpt-4']
)
return {
'stage': 'judge',
'fast_metrics': fast_metrics,
'judge_score': judge_scores['mean_score'],
'passed': judge_scores['mean_score'] >= judge_threshold,
'consensus': judge_scores['consensus']
}
# Fast metrics decisive: skip judge
return {
'stage': 'fast_metrics',
'fast_metrics': fast_metrics,
'passed': fast_metrics['semantic_sim'] > 0.7,
'needs_judge': False
}
This three-stage approach is efficient: 70% of examples are decided by deterministic checks (milliseconds), 20% by fast metrics (seconds), and only 10% reach the judge (30 seconds). Overall evaluation time is dominated by the judge for the uncertain examples, not the full set.
Key Takeaways
- Judge instructions are everything: Clear criteria, exemplars, and explicit scales dramatically improve consistency.
- Multi-judge ensembles beat single judges: Three judges (different models or temperatures) achieve 85%+ human agreement.
- Calibrate judges on your golden dataset: Detect bias and correlation; iterate until Spearman > 0.70.
- Use judges as a second stage: Deterministic checks first, then metrics, then judges only for uncertain examples.
- Monitor judge agreement over time: If judges start disagreeing, re-calibrate; you may have task drift.
Frequently Asked Questions
Is LLM-as-judge cheaper than hiring human annotators?
Usually yes. A human annotator costs $15–50 per example. An LLM judge costs $0.001–0.01 per example. For 10,000 examples, LLM judges save $100,000+. But judges are less reliable than experts on nuanced tasks; use both for critical systems.
What temperature should judges use?
Lower temperature (0.2–0.4) for consistency; higher temperature (0.7–1.0) for diversity. Most production systems use 0.3: low enough for reproducibility, high enough to avoid mode collapse. If judges agree too much (std dev < 0.5), raise temperature slightly.
How do I handle when judges disagree strongly?
Disagreement often signals an ambiguous example. Log it, manually review, and clarify the task definition or reference answer. If judges agree within 1–2 points out of 10, that's acceptable variance; above 3 points, investigate.
Can I fine-tune a judge?
Yes, but it's work. Collect a dataset of judge outputs and human corrections, then fine-tune. In 2026, it's usually faster to use a more capable off-the-shelf judge (upgrade from GPT-3.5 to GPT-4) than to fine-tune.
What if my task is domain-specific and judges don't understand it?
Add task-specific context to the judge prompt: domain terminology, examples, constraints. If that doesn't work, use a smaller, specialized judge model fine-tuned on your domain. Failing that, use a domain expert as judge (human or hybrid).
Further Reading
- Judging LLM-as-a-Judge: A Review of Large Language Model Evaluators — Comprehensive analysis of LLM judge biases and solutions.
- Prometheus: Fully Synthetic Instruction-Guided Judge — Research on optimizing judge models through synthetic data.
- LLM-Eval: A Comprehensive Evaluation Suite for Large Language Models — Framework for multi-judge evaluation systems.
- Anthropic Constitution-AI: Judge Design — Anthropic's approach to building consistent judges via constitutional prompting.