Human Review Workflows: Quality Control for Extracted Data
Human review workflows are systems that route extracted data to humans for verification, correction, and approval before downstream processing. Without human review, even small extraction errors cascade: a wrong vendor name blocks payment, a wrong total breaks accounting reconciliation. The challenge is designing review workflows that are efficient (reviewers don't waste time on obvious extractions) and effective (catching real errors before they cause damage).
I've built human review systems for financial documents, and the key insight is: automation should handle the easy cases (high-confidence, clean extractions) while humans focus on the ambiguous, high-risk ones. A well-designed workflow reduces manual review to 5-10% of documents while catching 99%+ of errors in that small set.
Workflow Architecture
A complete human review system has several components:
1. Review Queuing System
Route documents to reviewers based on priority and confidence:
from dataclasses import dataclass
from enum import Enum
from datetime import datetime, timedelta
class ReviewPriority(Enum):
URGENT = 1
HIGH = 2
NORMAL = 3
LOW = 4
class ReviewStatus(Enum):
QUEUED = "queued"
IN_PROGRESS = "in_progress"
APPROVED = "approved"
REJECTED = "rejected"
ESCALATED = "escalated"
@dataclass
class ReviewTask:
document_id: str
extracted_data: dict
confidence_score: float
priority: ReviewPriority
created_at: datetime
reviewer_id: str = None
status: ReviewStatus = ReviewStatus.QUEUED
notes: str = None
corrections: dict = None
approved_at: datetime = None
class ReviewQueue:
def __init__(self):
self.queue = []
def add_task(self, task: ReviewTask):
"""Add a document to the review queue."""
self.queue.append(task)
self.queue.sort(
key=lambda t: (
t.priority.value,
-t.created_at.timestamp(), # Newer first within priority
t.confidence_score # Lower confidence first
)
)
def get_next_for_reviewer(self, reviewer_id: str, limit: int = 5) -> list[ReviewTask]:
"""Get next documents for a specific reviewer."""
available = [
t for t in self.queue
if t.status == ReviewStatus.QUEUED
]
# Assign to reviewer and return
for task in available[:limit]:
task.reviewer_id = reviewer_id
task.status = ReviewStatus.IN_PROGRESS
return available[:limit]
def submit_review(self, task_id: str, is_approved: bool,
corrections: dict = None, notes: str = None):
"""Submit a reviewer's decision."""
task = next((t for t in self.queue if t.document_id == task_id), None)
if not task:
return False
task.status = ReviewStatus.APPROVED if is_approved else ReviewStatus.REJECTED
task.corrections = corrections
task.notes = notes
task.approved_at = datetime.now()
return True
# Example usage
queue = ReviewQueue()
# Add low-confidence documents for review
for doc_id, extracted_data, confidence in [
("doc_001", {"vendor": "ACME", "total": 1000}, 0.65),
("doc_002", {"vendor": "XYZ Corp", "total": 5000}, 0.55)
]:
task = ReviewTask(
document_id=doc_id,
extracted_data=extracted_data,
confidence_score=confidence,
priority=ReviewPriority.HIGH if confidence < 0.70 else ReviewPriority.NORMAL
)
queue.add_task(task)
# Get tasks for a reviewer
reviewer_tasks = queue.get_next_for_reviewer("reviewer_001", limit=5)
print(f"Reviewer has {len(reviewer_tasks)} tasks")
2. Review Interface Design
A minimal review interface shows the extracted data, original document, and allows corrections:
@dataclass
class ReviewInterface:
"""Data structure for a human review interface."""
document_id: str
document_image_url: str
extracted_data: dict
confidence_breakdown: dict # Field-level confidence scores
extraction_reasoning: dict # Why the model extracted each field
def to_json(self) -> dict:
"""Convert to JSON for frontend display."""
return {
"document_id": self.document_id,
"document_image_url": self.document_image_url,
"extracted_data": self.extracted_data,
"confidence_breakdown": self.confidence_breakdown,
"extraction_reasoning": self.extraction_reasoning,
"editable_fields": list(self.extracted_data.keys()),
"hints": self._generate_hints()
}
def _generate_hints(self) -> dict:
"""Generate hints for reviewers based on low-confidence fields."""
hints = {}
for field, confidence in self.confidence_breakdown.items():
if confidence < 0.70:
hints[field] = "Low confidence — double-check this field carefully"
return hints
def create_review_interface(task: ReviewTask) -> ReviewInterface:
"""Create a review interface for a single task."""
return ReviewInterface(
document_id=task.document_id,
document_image_url=f"/documents/{task.document_id}/image.jpg",
extracted_data=task.extracted_data,
confidence_breakdown={
"vendor": 0.85,
"total": 0.72,
"invoice_date": 0.90
},
extraction_reasoning={
"vendor": "Extracted from document header (large bold text)",
"total": "Sum of line items may be more reliable",
"invoice_date": "Clearly labeled 'Invoice Date'"
}
)
3. Approval / Rejection Decision
Simple decision interface:
class ReviewDecision:
"""Record a reviewer's decision."""
APPROVE = "approved"
REJECT = "rejected"
REQUEST_INFO = "request_info"
ESCALATE = "escalate"
def __init__(self, decision_type: str, corrections: dict = None,
reason: str = None):
self.decision_type = decision_type
self.corrections = corrections or {}
self.reason = reason
self.timestamp = datetime.now()
def record_review_decision(task_id: str, decision: ReviewDecision,
reviewer_id: str) -> bool:
"""Record a reviewer's decision and update the document."""
if decision.decision_type == ReviewDecision.APPROVE:
# Approved: write to database or send to downstream system
print(f"Document {task_id} approved by {reviewer_id}")
return True
elif decision.decision_type == ReviewDecision.REJECT:
# Rejected: route back to re-extraction or escalation
print(f"Document {task_id} rejected: {decision.reason}")
return False
elif decision.decision_type == ReviewDecision.REQUEST_INFO:
# Request additional information from data source
print(f"Requesting info for {task_id}: {decision.reason}")
return False
elif decision.decision_type == ReviewDecision.ESCALATE:
# Escalate to senior reviewer
print(f"Document {task_id} escalated for review")
return False
Feedback Loops for Continuous Improvement
Track reviewer corrections to improve extraction prompts:
class ExtractionFeedback:
"""Track corrections made by human reviewers."""
def __init__(self):
self.feedback_log = []
def log_correction(self, document_id: str, field_name: str,
extracted_value: str, corrected_value: str,
reason: str, reviewer_id: str):
"""Log a field correction."""
self.feedback_log.append({
"document_id": document_id,
"field": field_name,
"extracted": extracted_value,
"corrected": corrected_value,
"reason": reason,
"reviewer_id": reviewer_id,
"timestamp": datetime.now().isoformat()
})
def get_common_errors(self, field_name: str, top_n: int = 10) -> list[dict]:
"""Identify most common extraction errors for a field."""
field_corrections = [
f for f in self.feedback_log if f["field"] == field_name
]
# Count error patterns
error_patterns = {}
for correction in field_corrections:
key = (correction["extracted"], correction["corrected"])
error_patterns[key] = error_patterns.get(key, 0) + 1
# Sort by frequency
sorted_errors = sorted(
error_patterns.items(),
key=lambda x: x[1],
reverse=True
)
return [
{
"extracted_value": pattern[0],
"corrected_value": pattern[1],
"frequency": count,
"error_type": self._classify_error(pattern[0], pattern[1])
}
for (pattern, count) in sorted_errors[:top_n]
]
def _classify_error(self, extracted: str, corrected: str) -> str:
"""Classify the type of error."""
if extracted is None or extracted == "":
return "missing_value"
if str(extracted).lower() != str(corrected).lower():
if len(extracted) > 0 and extracted[0] == corrected[0]:
return "partial_match"
return "wrong_value"
return "case_mismatch"
# Example usage
feedback = ExtractionFeedback()
# Log some corrections
feedback.log_correction(
"doc_001", "vendor_name",
"ACME Corp", "ACME Corporation",
"Full company name", "reviewer_001"
)
feedback.log_correction(
"doc_002", "vendor_name",
"ACME Corp", "ACME Corporation",
"Full company name", "reviewer_002"
)
# Analyze errors
vendor_errors = feedback.get_common_errors("vendor_name")
print(f"Top extraction errors for vendor_name:")
for error in vendor_errors:
print(f" {error['extracted_value']} → {error['corrected_value']} "
f"(frequency: {error['frequency']})")
Sampling and Audit Trails
For compliance, maintain audit trails of all reviewed documents:
from datetime import datetime
import hashlib
class AuditTrail:
"""Maintain an audit trail of document processing."""
def __init__(self):
self.entries = []
def log_extraction(self, document_id: str, extracted_data: dict,
model: str, timestamp: datetime):
"""Log extraction event."""
self.entries.append({
"event": "extraction",
"document_id": document_id,
"extracted_data_hash": hashlib.sha256(
str(extracted_data).encode()
).hexdigest(),
"model": model,
"timestamp": timestamp.isoformat()
})
def log_review(self, document_id: str, reviewer_id: str,
decision: str, corrections: dict = None):
"""Log review event."""
self.entries.append({
"event": "review",
"document_id": document_id,
"reviewer_id": reviewer_id,
"decision": decision,
"has_corrections": corrections is not None and len(corrections) > 0,
"timestamp": datetime.now().isoformat()
})
def log_approval(self, document_id: str, approved_by: str,
final_data_hash: str):
"""Log final approval."""
self.entries.append({
"event": "approval",
"document_id": document_id,
"approved_by": approved_by,
"final_data_hash": final_data_hash,
"timestamp": datetime.now().isoformat()
})
def get_document_history(self, document_id: str) -> list[dict]:
"""Retrieve full history for a document."""
return [e for e in self.entries if e["document_id"] == document_id]
# Sampling for QA
def sample_approved_documents(audit_trail: AuditTrail, sample_rate: float = 0.05):
"""Sample recently approved documents for spot-check QA."""
approved_events = [e for e in audit_trail.entries if e["event"] == "approval"]
sample_size = max(1, int(len(approved_events) * sample_rate))
import random
return random.sample(approved_events, sample_size)
Metrics and Monitoring
Track review workflow performance:
def compute_review_metrics(queue: ReviewQueue, feedback: ExtractionFeedback) -> dict:
"""Compute metrics on the review workflow."""
total_tasks = len(queue.queue)
approved_tasks = sum(
1 for t in queue.queue if t.status == ReviewStatus.APPROVED
)
rejected_tasks = sum(
1 for t in queue.queue if t.status == ReviewStatus.REJECTED
)
approval_rate = approved_tasks / total_tasks if total_tasks > 0 else 0
# Average review time (approved tasks)
approved = [t for t in queue.queue if t.status == ReviewStatus.APPROVED]
if approved:
review_times = [
(t.approved_at - t.created_at).total_seconds()
for t in approved
if t.approved_at and t.created_at
]
avg_review_time = sum(review_times) / len(review_times)
else:
avg_review_time = None
# Correction rate
corrections_made = len(feedback.feedback_log)
correction_rate = corrections_made / approved_tasks if approved_tasks > 0 else 0
return {
"total_documents_reviewed": total_tasks,
"approved_count": approved_tasks,
"rejected_count": rejected_tasks,
"approval_rate": approval_rate,
"avg_review_time_seconds": avg_review_time,
"total_corrections_made": corrections_made,
"correction_rate": correction_rate,
"estimated_error_rate_if_no_review": 0.15, # Placeholder
"actual_error_rate_post_review": correction_rate
}
Key Takeaways
- Human review workflows route documents to humans for verification, catching errors before downstream processing.
- Queue documents by priority and confidence; auto-process high-confidence ones; review low-confidence and high-risk documents.
- Provide reviewers with clear interfaces: original document image, extracted data, confidence scores, and easy correction UI.
- Track corrections in a feedback loop; analyze common errors to improve extraction prompts.
- Maintain audit trails for compliance; sample approved documents for spot-check QA.
Frequently Asked Questions
What percentage of documents should go to human review?
Depends on your tolerance for errors. For financial documents (high-risk), 15-25% review is typical. For informational documents, 2-5%. Use confidence thresholds to determine review rates. If you're approving 95%+ of documents, your thresholds may be too loose.
How do I handle reviewer disagreement?
If two reviewers disagree on a correction, escalate to a senior reviewer. Track disagreement rates; high disagreement suggests the extraction is genuinely ambiguous and requires clarification (better image, more context) rather than just reviewer differences.
Can I automate reviewer feedback?
Partially. Common corrections (casing, formatting) can be automated. Substantive corrections (wrong value) should stay with human reviewers, but you can use them to retrain extraction prompts. Run A/B tests on prompt changes using your feedback data.
What about reviewer fatigue?
Review fatigue is real. Limit reviewers to 50-100 documents per day. Vary the difficulty; mixing easy, obvious corrections with harder cases keeps reviewers engaged. Provide feedback on their accuracy and productivity.
How do I handle appeals if a reviewer approves something wrong?
Log all decisions with timestamps and reviewer IDs. If downstream QA catches an error, flag it and re-review with the original reviewer. Use this as a learning moment; don't penalize the reviewer but adjust thresholds if needed.