Evaluating Fine-Tuned Models: Benchmarks and Metrics
Evaluating a fine-tuned LoRA model requires moving beyond training loss to measure downstream task performance. Different tasks demand different metrics: text classification uses F1-score and accuracy, generation tasks use BLEU and ROUGE, and conversational AI uses METEOR and BERTScore. This guide covers metric selection, benchmark datasets, comparison strategies (LoRA vs. full fine-tuning vs. base model), and practical evaluation workflows, enabling you to confidently measure adaptation quality and detect overfitting.
Key Evaluation Metrics by Task
Task-agnostic metrics (useful across all tasks):
| Metric | Definition | Use Case | Good Range |
|---|---|---|---|
| Perplexity | exp(mean(cross_entropy)) | Language modeling | Lower is better; <50 for domain-adapted models |
| Loss (validation) | Cross-entropy on held-out data | General | Stable or decreasing; should not increase |
| Accuracy | (correct predictions) / (total) | Classification | Task-dependent; 80–95% typical |
Task-specific metrics:
Classification (text/sentiment/intent):
- Accuracy: Fraction correct; simple but ignores class imbalance.
- F1-score (macro/weighted): Balances precision and recall; use for imbalanced datasets.
- Confusion matrix: Detailed breakdown; identify which classes confuse the model.
Generation (summarization, translation, code):
- BLEU (Bilingual Evaluation Understudy): N-gram overlap with reference; standard for translation. Range [0–1]; 0.25+ is good, 0.5+ is strong.
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Recall-based; common for summarization. ROUGE-1/2/L variants measure unigram/bigram/longest-common-subsequence overlap.
- METEOR: Precision + recall with synonyms; better than BLEU for evaluating paraphrases.
Dialogue and open-ended:
- BERTScore: Contextual embedding similarity to reference; correlates well with human judgment.
- BLEURT: Learned metric combining BLEU, chrF, and contextual features; requires model download but highly predictive of quality.
Code generation:
- Pass@K: Fraction of generated code samples that pass unit tests (K is the number of samples generated per prompt). Standard in code evaluation (e.g., HumanEval).
- Exact match: Generated code exactly matches reference (rare; too strict).
Evaluation Workflow: Three-Way Comparison
Always evaluate three models to contextualize results:
- Base model (zero-shot): No fine-tuning; establishes the baseline.
- LoRA fine-tuned model: Your trained adapter.
- Full fine-tuning (optional): Fully trained model; measures LoRA quality gap.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from datasets import load_dataset
# Load test dataset
test_data = load_dataset("json", data_files="test.jsonl")["train"]
# Model IDs
base_model_id = "meta-llama/Llama-2-7b-hf"
lora_adapter_dir = "./llama2-7b-customer-support-adapter"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Load LoRA model
lora_model = PeftModel.from_pretrained(
base_model,
lora_adapter_dir,
is_trainable=False
)
# Evaluate both
def evaluate_classification(model, test_data, tokenizer):
"""Evaluate classification accuracy and F1-score."""
from sklearn.metrics import accuracy_score, f1_score
predictions, labels = [], []
for example in test_data:
input_text = example["input"]
true_label = example["label"] # Assume 0 or 1
# Generate prediction
inputs = tokenizer(input_text, return_tensors="pt")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=10)
pred_text = tokenizer.decode(output[0])
# Simple heuristic: if output contains "issue" → label 1, else 0
pred_label = 1 if "issue" in pred_text.lower() else 0
predictions.append(pred_label)
labels.append(true_label)
accuracy = accuracy_score(labels, predictions)
f1 = f1_score(labels, predictions)
return {"accuracy": accuracy, "f1_score": f1}
# Evaluate
print("Base model:")
base_metrics = evaluate_classification(base_model, test_data, tokenizer)
print(f" Accuracy: {base_metrics['accuracy']:.4f}, F1: {base_metrics['f1_score']:.4f}")
print("LoRA fine-tuned model:")
lora_metrics = evaluate_classification(lora_model, test_data, tokenizer)
print(f" Accuracy: {lora_metrics['accuracy']:.4f}, F1: {lora_metrics['f1_score']:.4f}")
# Compute improvement
improvement = (lora_metrics['accuracy'] - base_metrics['accuracy']) * 100
print(f"\nImprovement: +{improvement:.2f} percentage points")
Typical results:
- Base model (zero-shot): 60–70% accuracy on domain-specific tasks.
- LoRA-tuned: 90–98% accuracy.
- Full fine-tuning: 99%+ accuracy (0–2% gap vs. LoRA).
Benchmark Datasets for Validation
Use established benchmarks to benchmark your fine-tuning against published baselines:
Instruction-following & QA:
- MMLU (Massive Multitask Language Understanding): 15K multiple-choice questions across 57 subjects. Standard for instruction-tuned models. Download from Hugging Face:
datasets.load_dataset("cais/mmlu"). - Alpaca Eval: 805 instruction-following prompts; compare model outputs using GPT-4 as judge. Reproducible leaderboard.
Classification:
- SST-2 (Stanford Sentiment Treebank): 67K movie reviews (binary sentiment). From GLUE benchmark.
- TREC-6: 6-class intent classification; small but standard.
Generation:
- SQuAD (Question Answering): 100K+ QA pairs; measure exact match and F1 on answer span.
- CNN/DailyMail (Summarization): 300K news articles; use ROUGE scores.
Code:
- HumanEval: 164 hand-written Python functions with unit tests. Measure Pass@1, Pass@10, Pass@100.
- CodeXGLUE: Multi-task code understanding benchmark.
Example: Evaluate on MMLU (instruction-following):
from datasets import load_dataset
# Load MMLU
mmlu = load_dataset("cais/mmlu", "all")
# Evaluate on first 100 dev examples
dev_data = mmlu["dev"][:100]
correct = 0
for example in dev_data:
question = example["question"]
choices = example["choices"]
correct_answer = example["answer"] # Index 0–3
# Prompt model
prompt = f"{question}\nA) {choices[0]}\nB) {choices[1]}\nC) {choices[2]}\nD) {choices[3]}"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output = lora_model.generate(**inputs, max_new_tokens=5)
pred_text = tokenizer.decode(output[0])
# Simple heuristic: extract first A/B/C/D
for i, choice in enumerate(["A", "B", "C", "D"]):
if choice in pred_text:
pred_idx = i
break
if pred_idx == correct_answer:
correct += 1
accuracy = correct / len(dev_data)
print(f"MMLU accuracy (first 100 examples): {accuracy:.2%}")
Generation Quality: BLEU and ROUGE
For text generation tasks (summarization, translation, paraphrase):
from rouge import Rouge
from nltk.translate.bleu_score import corpus_bleu
# Load reference summaries and generated summaries
references = ["The cat sat on the mat"]
generated = ["A cat was sitting on the mat"]
# ROUGE Score
rouge = Rouge()
scores = rouge.get_scores(generated, references, avg=True)
print(f"ROUGE-1 F1: {scores['rouge1']['f']:.4f}")
print(f"ROUGE-2 F1: {scores['rouge2']['f']:.4f}")
print(f"ROUGE-L F1: {scores['rougeL']['f']:.4f}")
# BLEU Score (requires tokenized input)
reference_tokens = [ref.split() for ref in references]
generated_tokens = [gen.split() for gen in generated]
bleu = corpus_bleu(reference_tokens, generated_tokens)
print(f"BLEU-4: {bleu:.4f}")
Interpretation:
- ROUGE-1 F1: 0.3–0.4 is average; 0.5+ is strong.
- BLEU-4: 0.25–0.35 is average; 0.5+ is strong.
Adversarial Evaluation: Out-of-Distribution Tests
Beyond standard benchmarks, test robustness to distribution shift:
# Test on examples the model likely hasn't seen
# 1. Paraphrased inputs
paraphrase_examples = [
"My account is inaccessible. I can't sign in.", # Rephrasing
"Unable to authenticate. Access denied.", # Alternative phrasing
]
# 2. Noisy inputs
typo_examples = [
"I can't log into my accoutn.", # Typo
"Wher is the password reset link?", # Typo + grammar
]
# 3. Long/complex inputs
long_examples = [
"I've been trying to log into my account for the past three days, but every time I enter my username and password, the system keeps telling me that my credentials are invalid, even though I'm sure I'm using the correct password that I set up when I first registered.",
]
all_adversarial = paraphrase_examples + typo_examples + long_examples
# Evaluate
adversarial_accuracy = evaluate_classification(lora_model, all_adversarial, tokenizer)
print(f"Adversarial robustness: {adversarial_accuracy['accuracy']:.2%}")
# If significantly lower than test accuracy, the model is brittle
if adversarial_accuracy['accuracy'] < base_metrics['accuracy'] * 0.9:
print("WARNING: Model is brittle to distribution shift. Consider data augmentation.")
Compute Evaluation Metrics Programmatically
Wrap everything into a reusable evaluation script:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
def comprehensive_evaluation(model, test_data, tokenizer, task="classification"):
"""Comprehensive evaluation for classification or generation."""
predictions = []
references = []
for example in test_data:
input_text = example["input"]
reference = example["label"] # or "reference" for generation
# Generate prediction
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=20)
prediction = tokenizer.decode(output[0], skip_special_tokens=True)
predictions.append(prediction)
references.append(reference)
if task == "classification":
# Parse predictions (example: extract first token or similarity)
pred_labels = [1 if "issue" in p.lower() else 0 for p in predictions]
ref_labels = references
metrics = {
"accuracy": accuracy_score(ref_labels, pred_labels),
"f1_macro": f1_score(ref_labels, pred_labels, average="macro"),
"precision": precision_score(ref_labels, pred_labels),
"recall": recall_score(ref_labels, pred_labels),
}
elif task == "generation":
# Use ROUGE for generation
rouge = Rouge()
scores = rouge.get_scores(predictions, references, avg=True)
metrics = {
"rouge1": scores["rouge1"]["f"],
"rouge2": scores["rouge2"]["f"],
"rougeL": scores["rougeL"]["f"],
}
return metrics
# Run comprehensive evaluation
print("Evaluating LoRA model...")
metrics = comprehensive_evaluation(lora_model, test_data, tokenizer, task="classification")
for metric, value in metrics.items():
print(f"{metric}: {value:.4f}")
Key Takeaways
- Select task-specific metrics: accuracy/F1 for classification, BLEU/ROUGE for generation, Pass@K for code.
- Always evaluate three models: base (zero-shot), LoRA, and full fine-tuning to contextualize improvements.
- Use established benchmarks (MMLU, GLUE, HumanEval) for reproducible evaluation.
- Test robustness via adversarial examples (paraphrases, typos, out-of-distribution inputs).
- Monitor both performance and fairness; evaluate on demographic subgroups if applicable.
Frequently Asked Questions
My LoRA model is 95% F1 vs. full fine-tuning's 97%. Is that acceptable?
Yes. A 2-point F1 gap is excellent for LoRA. Given the 99%+ parameter reduction, this trade-off is typically worth it for production. Monitor the gap; if it's >5 points, consider increasing rank or training longer.
What does high validation loss but high accuracy mean?
Loss and accuracy measure different things. High loss (e.g., 0.3) can occur if the model is overconfident (e.g., predicting 0.99 for a correct class vs. 0.51). High accuracy means the model is still making the right predictions. This is fine for classification; just watch that loss doesn't diverge further.
Should I evaluate on multiple seeds?
Yes, if you have time. Train 3 models with different random seeds; report mean and standard deviation. This accounts for variance in the fine-tuning process. For quick validation, one seed is acceptable.
How do I handle imbalanced datasets?
Use weighted F1-score (macro average) instead of unweighted accuracy. Alternatively, use stratified sampling during train/val split to ensure class balance. In the loss function, apply class weights: loss_weights = [1.0, 5.0] for a 5:1 imbalance.
Further Reading
- Hugging Face Evaluate Library — Metric implementations and documentation.
- GLUE Benchmark — Standard evaluation suite for language understanding.
- SQuAD Leaderboard — Question-answering benchmark and evaluation scripts.
- BLEU and Beyond: A Tutorial on Machine Translation Evaluation — Detailed guide to generation metrics.