Skip to main content

Building and Iterating on Evaluation Rubrics

An evaluation rubric is a structured set of criteria and scoring scales used to assess output quality. A good rubric translates subjective notions of quality ("this answer is good") into objective, measurable dimensions ("factuality: 8/10, clarity: 7/10, relevance: 9/10"). Rubrics are the backbone of both human and LLM-as-judge evaluation. In 2026, every production team uses rubrics—yet many rubrics are poorly designed, overly complex, or misaligned with actual quality. This article teaches you to design rubrics that are clear, measurable, and actionable; build exemplars that make scoring consistent; and iterate on rubrics as your task and model evolve.

Defining Rubric Dimensions

A rubric's strength lies in its dimensions: the specific aspects of quality you're measuring. A QA system might measure factuality, completeness, clarity, and relevance. A code generator might measure correctness, efficiency, and readability. The art is choosing dimensions that are (a) independent (each captures something different), (b) observable (raters can agree on them), and (c) actionable (improvements in one dimension reflect real quality improvements).

from typing import List, Dict

class RubricDimension:
"""A single dimension in an evaluation rubric."""

def __init__(
self,
name: str,
definition: str,
scale: int = 10,
weight: float = 1.0
):
self.name = name # E.g., "factuality", "clarity"
self.definition = definition # What does this dimension measure?
self.scale = scale # E.g., 1–10
self.weight = weight # Importance: 1.0 = baseline

# Anchor definitions: what does each score point mean?
self.anchors = {}

def add_anchor(self, score: int, description: str):
"""
Define what a specific score means.
E.g., 8: "mostly accurate with minor details missing"
"""
self.anchors[score] = description

class Rubric:
"""A complete evaluation rubric with multiple dimensions."""

def __init__(self, name: str, task_description: str):
self.name = name
self.task_description = task_description
self.dimensions: List[RubricDimension] = []

def add_dimension(self, dimension: RubricDimension):
self.dimensions.append(dimension)

def compute_overall_score(self, dimension_scores: Dict[str, int]) -> float:
"""
Compute weighted overall score from individual dimensions.
"""
weighted_sum = sum(
dimension_scores.get(d.name, 0) * d.weight
for d in self.dimensions
)
total_weight = sum(d.weight for d in self.dimensions)
return weighted_sum / total_weight if total_weight > 0 else 0

# Example: QA evaluation rubric
qa_rubric = Rubric("QA System Evaluation", "Assess quality of answers to factual questions")

factuality = RubricDimension("factuality", "Does the answer contain accurate information?")
factuality.add_anchor(9, "All facts are correct and well-supported")
factuality.add_anchor(7, "Mostly accurate; minor details or context missing")
factuality.add_anchor(5, "Mix of correct and incorrect information")
factuality.add_anchor(3, "Mostly inaccurate; some correct facts")
factuality.add_anchor(1, "Completely false or hallucinated")
qa_rubric.add_dimension(factuality)

clarity = RubricDimension("clarity", "Is the answer clear and well-organized?", weight=0.8)
clarity.add_anchor(9, "Exceptionally clear; easy to follow and understand")
clarity.add_anchor(7, "Clear with minor organizational issues")
clarity.add_anchor(5, "Somewhat unclear; needs re-reading")
clarity.add_anchor(3, "Confusing; poor organization")
clarity.add_anchor(1, "Incomprehensible")
qa_rubric.add_dimension(clarity)

relevance = RubricDimension("relevance", "Does it address the question asked?", weight=1.0)
relevance.add_anchor(9, "Directly answers all parts of the question")
relevance.add_anchor(7, "Answers main question; minor parts missed")
relevance.add_anchor(5, "Partially addresses the question")
relevance.add_anchor(3, "Mostly off-topic")
relevance.add_anchor(1, "Completely irrelevant")
qa_rubric.add_dimension(relevance)

Notice: each dimension has anchors (descriptions of what each score level means). This is critical—without anchors, raters guess what a "7" means. With anchors, they measure against a common reference.

Start with 3–5 dimensions. More than 5 becomes cognitively overwhelming for raters. If you find yourself wanting more dimensions, combine related ones or remove low-variance ones.

Scale Design and Granularity

Should your scale be 1–5, 1–10, or 1–100? This affects scoring precision and rater fatigue.

def compare_scales():
"""Trade-offs between different scales."""

scales = {
'1-3 (binary+1)': {
'granularity': 'low',
'rater_fatigue': 'low',
'interpretability': 'high',
'typical_use': 'binary pass/fail + nuance'
},
'1-5 (Likert)': {
'granularity': 'medium',
'rater_fatigue': 'low',
'interpretability': 'high',
'typical_use': 'general purpose, most common'
},
'1-10': {
'granularity': 'high',
'rater_fatigue': 'medium',
'interpretability': 'medium',
'typical_use': 'nuanced scoring, expert raters'
},
'1-100': {
'granularity': 'very high',
'rater_fatigue': 'high',
'interpretability': 'low',
'typical_use': 'rare; only for continuous scores'
}
}

return scales

# Recommendation: start with 1–5
# - 1: Fail (major issues)
# - 2: Poor (significant issues)
# - 3: Acceptable (some issues)
# - 4: Good (minor issues)
# - 5: Excellent (no meaningful issues)

A 1–5 scale is industry standard: it's granular enough to differentiate quality (five levels feels intuitive), but not so granular that raters waste time agonizing between 7 vs. 8. If you need more precision, use 1–10 but expect lower inter-rater agreement.

Exemplars: The Foundation of Consistent Scoring

The most impactful lever for improving rubric consistency: exemplars. For each dimension and score level, provide an actual example output and explain why it deserves that score.

class RubricExemplar:
"""An example output with explanation for a specific score."""

def __init__(
self,
dimension: str,
score: int,
input_text: str,
output_text: str,
explanation: str
):
self.dimension = dimension
self.score = score
self.input_text = input_text
self.output_text = output_text
self.explanation = explanation

def build_exemplar_library(rubric: Rubric) -> List[RubricExemplar]:
"""
Create 2–3 exemplars per dimension/score combo.
This is the most important part of rubric development.
"""
exemplars = [
RubricExemplar(
dimension="factuality",
score=9,
input_text="What is the capital of France?",
output_text="Paris, located in the Ile-de-France region, has been the capital since the 12th century.",
explanation="All facts are correct. Provides additional context (region, historical note) that enhances answer."
),
RubricExemplar(
dimension="factuality",
score=7,
input_text="What is the capital of France?",
output_text="The capital of France is Paris.",
explanation="Correct but minimal. Factually accurate but lacks supporting detail or context."
),
RubricExemplar(
dimension="factuality",
score=5,
input_text="What is the capital of France?",
output_text="France's capital is London. Paris is a major city but not the official capital since 1940.",
explanation="Contains a major error (London is UK's capital) mixed with partially correct info (Paris is major city, but factually wrong about 1940). Misleading."
),
RubricExemplar(
dimension="factuality",
score=1,
input_text="What is the capital of France?",
output_text="France doesn't have a capital. The country is divided among regional city-states.",
explanation="Completely false. Demonstrates hallucination—confidently asserting incorrect structure."
),

RubricExemplar(
dimension="clarity",
score=9,
input_text="Explain photosynthesis to a 10-year-old.",
output_text="Photosynthesis is how plants make food. Plants use sunlight, water, and air (carbon dioxide) to create sugar, which gives them energy to grow. It's like plants eating sunlight! They also release oxygen, which we breathe.",
explanation="Uses age-appropriate language, analogies (eating sunlight), and logical flow. Accessible without oversimplifying."
),
RubricExemplar(
dimension="clarity",
score=5,
input_text="Explain photosynthesis to a 10-year-old.",
output_text="Photosynthesis involves light-dependent and light-independent reactions where chlorophyll captures photons, exciting electrons in photosystem II. The electron transport chain transfers energy to ATP and NADPH.",
explanation="Technically correct but uses graduate-level jargon for a child. Very unclear; rater needs expert knowledge to validate accuracy."
),
]

return exemplars

Exemplars are the secret weapon of professional evaluation teams. They compress months of implicit knowledge into concrete examples. When raters disagree, the exemplar library is your tiebreaker: "Does this output align more with the 8-point or 7-point exemplar?"

Rubric Iteration and Refinement

Rubrics are not static. As your model improves, task definitions shift, or you discover ambiguities, refine your rubric.

def evaluate_rubric_quality(
rubric: Rubric,
golden_examples: List[dict],
human_scores: Dict[str, List[int]],
judge_scores: Dict[str, List[int]]
) -> dict:
"""
Measure rubric quality: inter-rater agreement, variance per dimension,
correlation with overall quality, and ambiguity hotspots.
"""
from scipy.stats import spearmanr
import numpy as np

results = {
'dimensions': {},
'overall_quality': None,
'recommendations': []
}

for dimension in rubric.dimensions:
human_dim_scores = human_scores.get(dimension.name, [])
judge_dim_scores = judge_scores.get(dimension.name, [])

if not human_dim_scores or not judge_dim_scores:
continue

# Correlation: judge scores vs. human scores
correlation, p_value = spearmanr(human_dim_scores, judge_dim_scores)

# Variance: how much do humans agree?
human_variance = np.var(human_dim_scores)

results['dimensions'][dimension.name] = {
'correlation': correlation,
'variance': human_variance,
'is_measurable': correlation > 0.70,
'is_stable': human_variance < 1.5
}

if correlation < 0.70:
results['recommendations'].append(
f"Dimension '{dimension.name}': Low judge-human correlation ({correlation:.2f}). "
f"Add exemplars or clarify definition."
)

if human_variance > 2.0:
results['recommendations'].append(
f"Dimension '{dimension.name}': High human variance ({human_variance:.2f}). "
f"Humans disagree on this dimension; possibly ill-defined or conflating multiple concepts."
)

return results

def refine_rubric(rubric: Rubric, evaluation: dict):
"""
Iteratively improve rubric based on evaluation feedback.
"""
for rec in evaluation['recommendations']:
print(f"Action: {rec}")
# In practice: update dimension definitions, add exemplars, or merge dimensions

Schedule rubric reviews quarterly. Run your judge against your golden dataset, measure inter-rater agreement, and iterate. After each iteration, re-validate on golden examples. A rubric that achieves 0.80+ correlation with human judgment is production-ready.

Key Takeaways

  • 3–5 independent dimensions capture quality: More dimensions = lower rater agreement and higher cognitive load.
  • Exemplars are the highest-leverage investment: 2–3 examples per dimension/score combo transform rubric consistency.
  • Use 1–5 scales unless you have specific reasons for finer granularity: Raters fatigue easily; Likert scales are standard.
  • Validate rubric quality on golden dataset: Aim for 0.75+ inter-rater agreement and high correlation between dimensions and overall quality.
  • Iterate quarterly: Rubrics degrade as models improve and tasks evolve; keep them aligned with reality.

Frequently Asked Questions

How many exemplars do I need per dimension?

Aim for 2–3 per score level per dimension. For a 5-point scale with 3 dimensions, that's 30–45 exemplars total—manageable. Each exemplar should be drawn from real examples or written to be realistic.

Should all dimensions have equal weight?

Not necessarily. In a code generation rubric, correctness might matter more than style. Set weights based on your business goals: correctness 1.5x, style 1.0x. Validate that the weighted score correlates with overall quality.

What if raters strongly disagree on one dimension?

That dimension is probably conflating multiple concepts or poorly defined. Either split it (create two dimensions) or remove it. Use the exemplar library to identify which score level raters are confusing; clarify the boundary between them.

How do I know when a rubric is "done"?

When inter-rater agreement (Spearman > 0.75) and judge-human correlation (> 0.75) both exceed threshold. If agreement is below 0.70 after exemplars and clarifications, the dimension may not be measurable—remove it.

Can I reuse rubrics across tasks?

Some dimensions transfer (clarity, correctness). Others are task-specific. Start with a generic rubric, then customize dimensions and exemplars for your task. Document the customizations so you can track what changed.

Further Reading