Skip to main content

Building Effective Feedback Loops: Close Production Loop

A feedback loop connects production observations back to development, enabling continuous improvement. Without feedback loops, you monitor issues but never act on them systematically. This article covers how to collect user feedback (explicit and implicit), correlate it with quality metrics, and use it to retrain models and refine prompts. Done well, feedback loops transform your LLM system from static to continuously improving.

Why feedback loops matter

Monitoring detects problems; feedback loops fix them. If your anomaly detector alerts on coherence drop, you investigate and discover a prompt change that broke a use case. Without a structured feedback loop, that insight stays in email or a ticket; with a loop, that insight retrains your evaluators and your prompt. Over weeks, these small improvements accumulate into significant quality gains.

Feedback loops are especially critical for LLMs because:

  • User preferences are diverse: What's "good" varies by user, domain, and use case. Feedback reveals this diversity.
  • Edge cases are common: Your LLM will encounter queries you never anticipated in offline testing. Feedback reveals edge cases.
  • Drift is inevitable: User distributions shift, and models degrade. Feedback lets you adapt quickly.

Collecting user feedback: explicit and implicit

Explicit feedback requires user action: thumbs-up/thumbs-down, star ratings, text comments.

Implicit feedback is inferred from user behavior: did they use the output, edit it, or abandon it?

# Feedback collection interface
from datetime import datetime
from enum import Enum

class FeedbackType(Enum):
EXPLICIT_THUMBS_UP = "explicit_thumbs_up"
EXPLICIT_THUMBS_DOWN = "explicit_thumbs_down"
EXPLICIT_RATING = "explicit_rating" # 1-5 stars
EXPLICIT_TEXT = "explicit_text_comment"
IMPLICIT_USAGE = "implicit_usage" # User used the output
IMPLICIT_EDIT = "implicit_edit" # User edited the output
IMPLICIT_DISCARD = "implicit_discard" # User didn't use the output

class FeedbackCollector:
def __init__(self, db):
self.db = db

def record_feedback(self, request_id, feedback_type, value, user_id=None, timestamp=None):
"""
Record a feedback event.
value: varies by type (True/False for thumbs, 1-5 for rating, text for comment).
"""
feedback_record = {
"request_id": request_id,
"feedback_type": feedback_type.value,
"value": value,
"user_id": user_id,
"timestamp": timestamp or datetime.utcnow().isoformat()
}
self.db.insert("feedback_events", feedback_record)

def get_feedback_summary(self, request_id):
"""
Aggregate feedback for a single request.
"""
feedback_events = self.db.query("feedback_events", {"request_id": request_id})

summary = {
"thumbs_up": 0,
"thumbs_down": 0,
"avg_rating": None,
"comments": [],
"usage_indicators": {}
}

ratings = []
for event in feedback_events:
if event["feedback_type"] == FeedbackType.EXPLICIT_THUMBS_UP.value:
summary["thumbs_up"] += 1
elif event["feedback_type"] == FeedbackType.EXPLICIT_THUMBS_DOWN.value:
summary["thumbs_down"] += 1
elif event["feedback_type"] == FeedbackType.EXPLICIT_RATING.value:
ratings.append(event["value"])
elif event["feedback_type"] == FeedbackType.EXPLICIT_TEXT.value:
summary["comments"].append(event["value"])
elif event["feedback_type"] in [FeedbackType.IMPLICIT_USAGE.value, FeedbackType.IMPLICIT_EDIT.value]:
summary["usage_indicators"][event["feedback_type"]] = True

if ratings:
summary["avg_rating"] = sum(ratings) / len(ratings)

return summary

# Example
collector = FeedbackCollector(db)
collector.record_feedback("req_123", FeedbackType.EXPLICIT_THUMBS_UP, True, user_id="user_456")
collector.record_feedback("req_123", FeedbackType.EXPLICIT_TEXT, "Great summary, thanks!", user_id="user_456")
summary = collector.get_feedback_summary("req_123")
print(f"Feedback summary: {summary}")

Correlating feedback with quality metrics

Link feedback to your automated quality metrics. If a user gives thumbs-down and your evaluator scored it high, your evaluator may be miscalibrated.

# Correlating human feedback with automated metrics
from scipy.stats import spearmanr, pearsonr

class FeedbackQualityAnalyzer:
def __init__(self, db):
self.db = db

def correlate_feedback_with_metrics(self):
"""
Compute correlation between human feedback and automated quality scores.
"""
# Retrieve pairs of (feedback, quality_score)
records = self.db.query(
"eval_logs",
filters={"feedback_collected": True}
)

human_ratings = []
automated_scores = []

for record in records:
# Get human feedback for this request
feedback = self.db.query("feedback_events", {"request_id": record["request_id"]})

if not feedback:
continue

# Aggregate human feedback to a single score (1-5)
human_score = 0
if any(f["feedback_type"] == "explicit_thumbs_up" for f in feedback):
human_score = 5
elif any(f["feedback_type"] == "explicit_thumbs_down" for f in feedback):
human_score = 1
else:
ratings = [f["value"] for f in feedback if f["feedback_type"] == "explicit_rating"]
if ratings:
human_score = sum(ratings) / len(ratings)

if human_score > 0:
human_ratings.append(human_score)
automated_scores.append(record["eval_scores"]["overall"])

# Compute correlations
pearson_r, pearson_p = pearsonr(human_ratings, automated_scores)
spearman_r, spearman_p = spearmanr(human_ratings, automated_scores)

return {
"pearson_r": pearson_r,
"pearson_p": pearson_p,
"spearman_r": spearman_r,
"spearman_p": spearman_p,
"n_samples": len(human_ratings)
}

# Example
analyzer = FeedbackQualityAnalyzer(db)
correlation = analyzer.correlate_feedback_with_metrics()
print(f"Correlation: {correlation}")

Closing the loop: feedback-driven retraining

Once you've identified issues via feedback, retrain:

  1. Evaluators: If human feedback disagrees with your evaluator, your evaluator is miscalibrated. Retrain it on feedback labels.
  2. Models: If many users give negative feedback on a particular query type, fine-tune your LLM on better examples.
  3. Prompts: If a system prompt change caused degradation (detected via feedback), revert it or refine it.
# Feedback-driven evaluator retraining
class EvaluatorRetrainer:
def __init__(self, db):
self.db = db

def identify_miscalibrated_outputs(self, feedback_sample_size=100):
"""
Find outputs where human feedback contradicts evaluator scores.
"""
# Retrieve a sample of feedback with associated eval scores
feedback_queries = [
self.db.query("feedback_events", {}, limit=feedback_sample_size)
]

miscalibrated = []

for feedback_record in feedback_queries[0]:
# Get the eval score for this request
eval_record = self.db.query(
"eval_logs",
{"request_id": feedback_record["request_id"]}
)

if not eval_record:
continue

eval_score = eval_record[0]["eval_scores"]["overall"]
human_feedback_is_positive = feedback_record["feedback_type"] in [
"explicit_thumbs_up", "explicit_rating"
]

# Check for disagreement
evaluator_predicts_good = eval_score > 0.7
human_says_good = human_feedback_is_positive

if evaluator_predicts_good != human_says_good:
miscalibrated.append({
"request_id": feedback_record["request_id"],
"eval_score": eval_score,
"human_says_good": human_says_good,
"output": eval_record[0]["output"],
"prompt": eval_record[0]["prompt"]
})

return miscalibrated

def retrain_evaluator(self, miscalibrated_samples):
"""
Retrain evaluator on miscalibrated samples.
"""
from transformers import AutoModelForSequenceClassification, Trainer

# Prepare training data
prompts = [s["prompt"] for s in miscalibrated_samples]
outputs = [s["output"] for s in miscalibrated_samples]
labels = [1 if s["human_says_good"] else 0 for s in miscalibrated_samples]

# Load and fine-tune model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

# (Trainer setup and training omitted for brevity)
# trainer = Trainer(model, training_args, train_dataset=...)
# trainer.train()

return model

Handling feedback biases

Feedback from users is biased:

  • Selection bias: Users who leave feedback are not representative of all users.
  • Positivity bias: Users may be less likely to report small issues but more likely to report major ones.
  • Demographic bias: Some user groups may provide more feedback than others.

Account for these when using feedback:

# Adjusting for feedback bias
class BiasCorrectedFeedbackAnalyzer:
def __init__(self, db):
self.db = db

def compute_feedback_rate_by_cohort(self):
"""
Measure who provides feedback; identify underrepresented cohorts.
"""
# Get all requests with cohort info
all_requests = self.db.query("eval_logs", {})

# Count feedback by cohort
feedback_by_cohort = {}
for request in all_requests:
cohort = request.get("metadata", {}).get("user_cohort")
if cohort not in feedback_by_cohort:
feedback_by_cohort[cohort] = {"total": 0, "with_feedback": 0}

feedback_by_cohort[cohort]["total"] += 1

# Check if feedback exists
feedback = self.db.query(
"feedback_events",
{"request_id": request["request_id"]}
)
if feedback:
feedback_by_cohort[cohort]["with_feedback"] += 1

# Compute feedback rates
feedback_rates = {
cohort: data["with_feedback"] / data["total"]
for cohort, data in feedback_by_cohort.items()
}

return feedback_rates

def weight_feedback_for_balance(self, feedback_records, feedback_rates):
"""
Up-weight feedback from underrepresented cohorts.
"""
weighted_feedback = []

for record in feedback_records:
cohort = record.get("user_cohort")
rate = feedback_rates.get(cohort, 0.5)

# Up-weight if underrepresented (low feedback rate)
weight = 1.0 / (rate + 0.01) # Avoid division by zero

weighted_feedback.append({
**record,
"weight": weight
})

return weighted_feedback

Measuring the impact of feedback-driven changes

Track whether feedback-driven improvements actually work:

# Measuring impact of feedback-driven changes
class ImpactMeasurement:
def __init__(self, db):
self.db = db

def measure_improvement_from_change(self, change_description, before_date, after_date):
"""
Measure quality before and after a feedback-driven change.
change_description: e.g., "Refined system prompt for edge cases"
"""

# Retrieve eval scores before and after
before_records = self.db.query(
"eval_logs",
{"timestamp": {"$lt": before_date}}
)
after_records = self.db.query(
"eval_logs",
{"timestamp": {"$gte": after_date}}
)

before_scores = [r["eval_scores"]["overall"] for r in before_records]
after_scores = [r["eval_scores"]["overall"] for r in after_records]

# Compute statistics
before_mean = np.mean(before_scores)
after_mean = np.mean(after_scores)
improvement = (after_mean - before_mean) / before_mean * 100

# Statistical test (t-test)
from scipy.stats import ttest_ind
t_stat, p_value = ttest_ind(after_scores, before_scores)

return {
"change": change_description,
"before_mean": before_mean,
"after_mean": after_mean,
"improvement_pct": improvement,
"p_value": p_value,
"significant": p_value < 0.05
}

# Example
measurement = ImpactMeasurement(db)
impact = measurement.measure_improvement_from_change(
change_description="Refined system prompt",
before_date="2026-05-15",
after_date="2026-05-22"
)
print(f"Impact: {impact}")

Key Takeaways

  • Feedback loops close the gap between monitoring and improvement: collect feedback, correlate with metrics, retrain.
  • Collect both explicit (thumbs-up/down, comments) and implicit (usage, edits) feedback; both provide signal.
  • Correlate human feedback with automated metrics to identify evaluator miscalibration.
  • Retrain evaluators, models, and prompts based on feedback; measure the impact of changes.
  • Account for feedback biases (selection, positivity) by weighting underrepresented cohorts.

Frequently Asked Questions

How much explicit feedback should I collect?

Aim for 5-10% of requests. Most users won't provide feedback; set a low bar (single thumbs-up/down button). More detailed feedback (ratings, comments) is rare but valuable.

Should I retrain my model on every batch of feedback, or accumulate?

Accumulate feedback for 1-4 weeks, then retrain. Frequent retraining can be destabilizing; infrequent retraining misses recent issues. Weekly is a good cadence.

What if feedback is contradictory (some users like the output, others dislike it)?

That's normal and informative. It often means the output is good for some use cases/cohorts and bad for others. Segment and analyze by cohort.

How do I incentivize users to provide feedback?

Show that their feedback drives improvements (e.g., "Your feedback led us to improve X"). Gamify (badges, streaks). For enterprise customers, track feedback metrics in SLAs.

Can I use feedback to detect adversarial attacks?

Partially. Adversarial outputs may get negative feedback, but not all negative feedback indicates attacks. Combine feedback signals with semantic anomaly detection.

Further Reading