Skip to main content

Online Evaluation Frameworks: Real-Time LLM Quality

Online evaluation is the process of scoring LLM outputs in production, in real time, as users interact with your system. Unlike offline evaluation (which tests on fixed datasets before launch), online evals measure quality on live, dynamic data. This article covers how to design and deploy evaluation frameworks that score outputs quickly, handle heterogeneous quality criteria, and feed results into your monitoring and feedback loops.

The case for online evaluation

Offline evaluation—running a suite of benchmarks on a fixed test set before deployment—is necessary but insufficient. Test sets age quickly; users discover use cases and edge cases your benchmarks never covered. By the time you realize your model underperforms on a new domain, production users may have already experienced degradation.

Online evaluation solves this by making quality assessment part of the production loop. As each user query is answered, you immediately score the output using lightweight evaluators. These scores feed into your monitoring dashboards, trigger alerts, and provide training data for continuous improvement. This closes the feedback loop: poor outputs are flagged immediately, not days or weeks later via user complaints.

Designing evaluators for production: speed vs. accuracy

An evaluator is a function that takes a (prompt, output) pair and returns a quality score. Evaluators vary in complexity:

Rule-based / heuristic evaluators (fastest):

  • Output token count within expected range?
  • Does output contain required keywords?
  • Is output free of known bad patterns (PII, URLs, code if not allowed)?

Lightweight neural evaluators (medium speed):

  • Semantic similarity between input and output (embedding dot-product).
  • Readability score (Flesch-Kincaid).
  • Sentiment or toxicity classification.

Heavy neural evaluators (slowest, most accurate):

  • Fine-tuned reward models (encode task-specific quality criteria).
  • Entailment or consistency checking (does output contradict input?).
  • Summarization evaluators (is the summary factually accurate?).

The trade-off: heavier evaluators are more accurate but add latency (100–500 ms per output). In production, you must balance evaluation quality with latency budget. A rule: evaluators should add <5% to your total inference latency.

# Multi-tier evaluation framework
from typing import Tuple
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

class OnlineEvaluator:
def __init__(self, latency_budget_ms=50):
self.latency_budget_ms = latency_budget_ms

def evaluate(self, prompt: str, output: str) -> dict:
"""
Composite evaluation: run fast checks first, then slower checks if budget allows.
"""
scores = {}

# Tier 1: Rule-based (fast)
scores["has_output"] = 1.0 if len(output.strip()) > 0 else 0.0
scores["token_count_ok"] = 1.0 if 10 < len(output.split()) < 500 else 0.0

# Tier 2: Lightweight (medium)
try:
prompt_emb = model.encode(prompt)
output_emb = model.encode(output)
similarity = np.dot(prompt_emb, output_emb) / (np.linalg.norm(prompt_emb) * np.linalg.norm(output_emb) + 1e-8)
scores["relevance"] = float(similarity)
except Exception as e:
scores["relevance"] = 0.5 # Default on error

# Tier 3: Semantic checks (can be skipped if latency budget exceeded)
refusal_phrases = ["cannot", "unable", "not able"]
scores["refusal"] = 1.0 if any(p in output.lower() for p in refusal_phrases) else 0.0

# Aggregate
scores["overall"] = np.mean([scores["has_output"], scores["token_count_ok"], scores["relevance"]])

return scores

# Example usage
evaluator = OnlineEvaluator(latency_budget_ms=50)
scores = evaluator.evaluate(
prompt="What is AI?",
output="Artificial Intelligence is the simulation of human intelligence by machines."
)
print(f"Evaluation scores: {scores}")

Multi-metric evaluation with weighted aggregation

Rather than a single score, return a multi-dimensional evaluation vector. This lets you:

  • Track individual metrics separately (identify which aspect is degrading).
  • Weight metrics according to your business priorities.
  • Adapt weights without retraining evaluators.
# Multi-metric evaluation with configurable weights
class WeightedEvaluator:
def __init__(self, metrics_config):
"""
metrics_config: dict of {metric_name: weight}
"""
self.metrics_config = metrics_config
self.weights = np.array(list(metrics_config.values()))
self.weights /= self.weights.sum() # Normalize to sum to 1.0

def evaluate(self, prompt, output):
"""
Compute weighted composite score.
"""
scores = {}

# Compute individual metrics
scores["relevance"] = self._relevance(prompt, output)
scores["coherence"] = self._coherence(output)
scores["safety"] = self._safety(output)
scores["conciseness"] = self._conciseness(output)

# Weighted aggregate
metric_scores = [scores[k] for k in self.metrics_config.keys()]
weighted_score = np.dot(self.weights, metric_scores)

return {
**scores,
"weighted_score": weighted_score,
"weights": dict(zip(self.metrics_config.keys(), self.weights))
}

def _relevance(self, prompt, output):
# Semantic similarity
prompt_emb = model.encode(prompt)
output_emb = model.encode(output)
return float(np.dot(prompt_emb, output_emb) / (np.linalg.norm(prompt_emb) * np.linalg.norm(output_emb) + 1e-8))

def _coherence(self, output):
# Check for repetition, grammar (placeholder)
return 0.8 # Placeholder

def _safety(self, output):
# Check for toxic/unsafe content
toxic_keywords = ["violence", "illegal", "hate"]
return 0.0 if any(k in output.lower() for k in toxic_keywords) else 1.0

def _conciseness(self, output):
# Penalty for verbose outputs
tokens = len(output.split())
if tokens > 300:
return 0.5
elif tokens > 150:
return 0.8
else:
return 1.0

# Configure for your task
config = {
"relevance": 0.4,
"coherence": 0.3,
"safety": 0.2,
"conciseness": 0.1
}
evaluator = WeightedEvaluator(config)
scores = evaluator.evaluate("What is AI?", "AI is a broad field...")
print(f"Multi-metric scores: {scores}")

Integration with production pipelines

Evaluators must integrate seamlessly into your inference pipeline. Sketch the flow:

  1. User submits prompt.
  2. LLM generates output.
  3. (Async) Evaluator scores output; stores score in database.
  4. User receives output immediately (don't block on evaluation).
  5. Evaluation scores feed into monitoring, dashboards, feedback collection.
# Async evaluation integration
import asyncio
import time
from datetime import datetime

async def generate_and_evaluate(client, prompt):
"""
Generate LLM output and async-evaluate it.
Return output to user immediately; evaluation happens in background.
"""

# Step 1: Generate
start = time.time()
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.7
)
output = response.choices[0].message.content
inference_latency_ms = (time.time() - start) * 1000

# Step 2: Return to user immediately
result = {
"output": output,
"inference_latency_ms": inference_latency_ms,
"eval_scores": None # Will be filled async
}

# Step 3: Async evaluation (fire and forget)
asyncio.create_task(
evaluate_and_store(prompt, output, inference_latency_ms)
)

return result

async def evaluate_and_store(prompt, output, inference_latency_ms):
"""
Background task: evaluate output and store scores.
"""
evaluator = OnlineEvaluator()
scores = evaluator.evaluate(prompt, output)

# Store in database
record = {
"timestamp": datetime.utcnow().isoformat(),
"prompt": prompt,
"output": output,
"inference_latency_ms": inference_latency_ms,
"eval_scores": scores
}

# Insert into monitoring store (Prometheus, Datadog, custom DB)
# db.insert("eval_logs", record)
print(f"Stored eval record: {record}")

Threshold-based quality gates

Use evaluation scores to define quality gates: if a score drops below a threshold, take action (e.g., log warning, flag for human review, fallback to a fallback model).

# Quality gates based on evaluation scores
class QualityGate:
def __init__(self, thresholds):
"""
thresholds: dict of {metric: min_acceptable_score}
"""
self.thresholds = thresholds

def check(self, scores):
"""
Check if output passes all gates.
Returns (passed: bool, failures: list of metric names)
"""
failures = []
for metric, threshold in self.thresholds.items():
if metric in scores and scores[metric] < threshold:
failures.append(metric)

return len(failures) == 0, failures

# Configure gates
gates = QualityGate({
"relevance": 0.6,
"safety": 0.9,
"overall": 0.7
})

# Use gates in pipeline
eval_scores = evaluator.evaluate("What is AI?", "AI is helpful...")
passed, failures = gates.check(eval_scores)

if not passed:
print(f"Quality gate failed on: {failures}")
# Take action: log, alert, fallback, etc.

Evaluating evaluators: bootstrapping quality signals

Your evaluators are not perfect. To gain confidence in them, periodically sample a few outputs, get human labels, and compare to your evaluator scores. This tells you:

  • Are evaluators consistent with human judgment?
  • Are thresholds calibrated correctly?
  • Should you retrain evaluators?
# Evaluator calibration test
def calibrate_evaluator(evaluator, labeled_samples):
"""
labeled_samples: list of (prompt, output, human_quality_label) tuples
Returns correlation between evaluator scores and human labels.
"""
from scipy.stats import spearmanr

evaluator_scores = []
human_scores = []

for prompt, output, human_label in labeled_samples:
scores = evaluator.evaluate(prompt, output)
evaluator_scores.append(scores.get("overall", 0.5))
# Assume human_label is 0-1 (bad to good)
human_scores.append(human_label)

correlation, p_value = spearmanr(evaluator_scores, human_scores)
print(f"Evaluator correlation with human labels: {correlation:.3f}, p={p_value:.4f}")

return correlation, p_value

Key Takeaways

  • Online evaluation scores outputs in real time; much more responsive than offline evaluation.
  • Use multi-tier evaluators: fast rule-based checks first, then lightweight neural, then heavy evaluators if budget allows.
  • Aggregate multiple metrics into a weighted composite score; allow dynamic weight adjustments.
  • Integrate evaluation asynchronously into production pipelines to avoid latency overhead.
  • Define quality gates (thresholds) and take action when outputs fail them.

Frequently Asked Questions

Should evaluator latency be included in my SLA?

No; run evaluators async in the background. Your SLA should cover LLM inference only. Evaluation happens after the user has received the output.

Can I use GPT-4 as an evaluator, or is it too slow?

GPT-4 is powerful but slow (~3–5 seconds). Use it for offline evaluation or spot-checks, not real-time production evaluation. For production, stick to lighter evaluators (embeddings, reward models).

How often should I re-calibrate my evaluators?

Monthly or quarterly, if you have enough labeled data. More frequent calibration (weekly) helps catch evaluator drift. Less frequent (yearly) is risky.

What if different user segments have different quality preferences?

Maintain separate evaluators and thresholds per segment. Or parameterize your evaluator weights by segment and adjust at inference time.

Can I combine human and automated evaluation?

Yes. Automated evaluation flags borderline outputs; humans label them. Use human labels to retrain/improve evaluators. This is the loop-closure mechanism.

Further Reading