Skip to main content

Deterministic Output Checks and Validation Rules

Deterministic output checks are the first gate in any evaluation pipeline: fast, rule-based validation that catches obvious failures before expensive metrics are computed. A deterministic check can verify that code output is syntactically valid Python, that a JSON response parses correctly, that a safety filter doesn't flag PII, or that a structured output matches a schema. These checks are cheap (milliseconds, not seconds) and reproducible (no randomness), making them ideal for continuous integration and high-volume evaluation.

In 2026, every production LLM system runs at least 5–10 deterministic checks: format validation, schema compliance, safety guardrails, toxicity detection, and domain-specific rules (e.g., "code outputs must be executable"). This article teaches you to layer these checks into a validation harness that rejects invalid outputs fast, allowing your expensive LLM-as-judge and metrics to focus on quality rather than correctness.

Format Validation and Schema Compliance

The first check: does the output match the expected format? If you ask a model for JSON, ensure it's valid JSON. If you ask for code, verify it parses. If you ask for a structured slot-filling task, verify all required fields are present.

import json
import ast
from typing import Tuple

def validate_json_output(output: str) -> Tuple[bool, dict]:
"""
Deterministic check: does output parse as valid JSON?
Returns (is_valid, parsed_dict or error).
"""
try:
parsed = json.loads(output)
return True, parsed
except json.JSONDecodeError as e:
return False, {'error': f'Invalid JSON: {str(e)}', 'output': output}

def validate_python_syntax(output: str) -> Tuple[bool, str]:
"""
Deterministic check: does output parse as valid Python?
Use for code generation tasks.
"""
try:
ast.parse(output)
return True, "Syntax valid"
except SyntaxError as e:
return False, f'Syntax error at line {e.lineno}: {e.msg}'

def validate_schema_compliance(output: dict, schema: dict) -> Tuple[bool, list]:
"""
Deterministic check: does output conform to a JSON schema?
schema: dict with required fields and their types.
Returns (is_valid, list of violations).
"""
violations = []

# Check required fields
for field in schema.get('required', []):
if field not in output:
violations.append(f'Missing required field: {field}')

# Check field types
for field, expected_type in schema.get('properties', {}).items():
if field in output:
actual_type = type(output[field]).__name__
if actual_type != expected_type:
violations.append(
f'Field {field}: expected {expected_type}, got {actual_type}'
)

return len(violations) == 0, violations

Use schema validation for structured tasks: slot-filling, information extraction, or multi-step reasoning where each step must follow a defined format. This catches hallucinations like "the model output nonsense rather than the JSON I requested" immediately.

Safety Guardrails and Content Filtering

LLM outputs can contain harmful content: instructions for dangerous activities, hateful speech, or sensitive information. Deterministic filters catch obvious violations before they reach users.

import re

class SafetyGuardrails:
"""
Deterministic safety checks: pattern matching + keyword detection.
Fast; use as a first filter before expensive LLM-as-judge scoring.
"""

def __init__(self):
self.danger_keywords = [
'bomb', 'poison', 'exploit', 'malware', 'ransomware', 'ddos'
]
self.hate_patterns = [
r'\b(slur_1|slur_2|slur_3)\b', # Use actual slurs; redacted here
]
self.pii_patterns = {
'ssn': r'\d{3}-\d{2}-\d{4}',
'credit_card': r'\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}',
'email': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
}

def check_dangerous_content(self, output: str) -> Tuple[bool, list]:
"""
Detect keywords associated with harmful instructions.
Returns (is_safe, list of violations).
"""
violations = []
text_lower = output.lower()

for keyword in self.danger_keywords:
if keyword in text_lower:
violations.append(f'Dangerous keyword detected: {keyword}')

return len(violations) == 0, violations

def check_hate_speech(self, output: str) -> Tuple[bool, list]:
"""Detect hate speech patterns."""
violations = []

for pattern in self.hate_patterns:
if re.search(pattern, output, re.IGNORECASE):
violations.append('Potential hate speech detected')
break

return len(violations) == 0, violations

def check_pii(self, output: str) -> Tuple[bool, list]:
"""Detect personally identifiable information."""
violations = []

for pii_type, pattern in self.pii_patterns.items():
matches = re.findall(pattern, output)
if matches:
violations.append(
f'Potential {pii_type} detected: {len(matches)} match(es)'
)

return len(violations) == 0, violations

def all_checks(self, output: str) -> Tuple[bool, dict]:
"""Run all safety checks; return aggregate result."""
results = {
'dangerous': self.check_dangerous_content(output),
'hate_speech': self.check_hate_speech(output),
'pii': self.check_pii(output)
}

all_safe = all(r[0] for r in results.values())
violations = {k: r[1] for k, r in results.items() if not r[0]}

return all_safe, violations

Pair deterministic safety checks with LLM-as-judge scoring for nuance. A deterministic check catches "I'll teach you to make a bomb" immediately. An LLM judge catches subtler harms like "I could explain the theory but won't."

Toxicity Detection

Toxicity differs from safety: a model might be toxically rude without being dangerous. Use pre-trained toxicity classifiers (Perspective API, detoxify library) as fast checks.

from typing import Dict

def detect_toxicity(text: str) -> Dict:
"""
Use Perspective API (Google) or detoxify library for toxicity scoring.
Simplified mock here; use actual implementation in production.
"""
# In production: call Perspective API or load detoxify model
# This is a placeholder showing the interface

toxicity_score = 0.0 # Score from 0 (non-toxic) to 1 (toxic)

# Actual implementation would call the API:
# response = perspective_client.comments().analyze(
# body={'comment': {'text': text}}
# ).execute()
# toxicity_score = response['attributeScores']['TOXICITY']['summaryScore']['value']

return {
'text': text[:100],
'toxicity': toxicity_score,
'is_toxic': toxicity_score > 0.7,
'flags': []
}

Toxicity detection is useful for dialogue systems, customer support chatbots, and any user-facing application. Set a threshold (e.g., toxicity > 0.7 = reject) and log borderline cases for manual review.

Domain-Specific Validation Rules

Beyond generic safety and format checks, implement domain-specific rules tailored to your task.

def validate_code_output(code: str) -> Tuple[bool, list]:
"""
Domain-specific: validate code generation output.
Check for common failure modes in code gen.
"""
violations = []

# Check 1: Python syntax
try:
ast.parse(code)
except SyntaxError as e:
violations.append(f'Syntax error: {e.msg}')

# Check 2: No hardcoded test assertions (bad practice)
if 'assert' in code and 'test' in code.lower():
violations.append('Output contains test assertions (check for copy-paste)')

# Check 3: No incomplete function definitions
if 'def ' in code and ':' not in code.split('def ')[-1]:
violations.append('Incomplete function definition')

# Check 4: Reasonable file size (not a 10MB output)
if len(code.split('\n')) > 500:
violations.append('Output exceeds 500 lines (possible copy-paste error)')

return len(violations) == 0, violations

def validate_qa_output(answer: str, question: str) -> Tuple[bool, list]:
"""
Domain-specific: validate QA system output.
Check for common failure modes in QA.
"""
violations = []

# Check 1: Answer mentions question (often a sign of confusion)
if question.lower() in answer.lower() and len(answer) < 50:
violations.append('Answer appears to repeat the question')

# Check 2: Answer is not just "I don't know" (acceptable but worth flagging)
if answer.lower() in ['i don\'t know', 'unknown', 'n/a']:
violations.append('Answer is a non-response (acceptable but worth noting)')

# Check 3: Minimum length (not a trivial answer)
if len(answer.split()) < 3:
violations.append('Answer is too short (< 3 words)')

return len(violations) == 0, violations

Domain rules encode your understanding of what "obviously wrong" looks like in your specific task. A code generator that outputs code with a syntax error failed obviously. A QA system that repeats the question failed obviously. Catch these fast before spending GPU cycles on LLM-as-judge.

Building a Validation Harness

Combine all checks into a single validation pipeline that short-circuits on failure:

def run_validation_pipeline(
output: str,
task_type: str = 'qa',
schema: dict = None
) -> Dict:
"""
Run deterministic checks in order: format, safety, domain-specific.
Return detailed results for logging and debugging.
"""
results = {
'passed_all': True,
'checks': {}
}

# Step 1: Format validation
if task_type == 'code':
is_valid, msg = validate_python_syntax(output)
results['checks']['syntax'] = {'passed': is_valid, 'message': msg}
if not is_valid:
results['passed_all'] = False
return results # Short-circuit

elif task_type == 'json':
is_valid, msg = validate_json_output(output)
results['checks']['json_parse'] = {'passed': is_valid, 'message': msg}
if not is_valid:
results['passed_all'] = False
return results

if schema:
is_valid, violations = validate_schema_compliance(msg, schema)
results['checks']['schema'] = {'passed': is_valid, 'violations': violations}
if not is_valid:
results['passed_all'] = False
return results

# Step 2: Safety checks
guardrails = SafetyGuardrails()
is_safe, violations = guardrails.all_checks(output)
results['checks']['safety'] = {'passed': is_safe, 'violations': violations}
if not is_safe:
results['passed_all'] = False
return results

# Step 3: Domain-specific
if task_type == 'code':
is_valid, violations = validate_code_output(output)
results['checks']['domain'] = {'passed': is_valid, 'violations': violations}

results['passed_all'] = all(
check.get('passed', True) for check in results['checks'].values()
)

return results

Run this validation for every model output. Log failures: "output failed on [syntax check]". This data is gold for understanding what's going wrong.

Key Takeaways

  • Deterministic checks are the first line of defense: They're fast and reproducible; use them to filter obvious failures.
  • Layer checks in order: format → safety → domain-specific. Short-circuit on first failure.
  • Domain rules encode your expertise: What looks wrong in your task space? Codify it.
  • Validation failures are debugging gold: Every failure is a signal. Log them for root-cause analysis.
  • Pair deterministic + learned: Use fast checks to filter, then apply expensive metrics to the passing set.

Frequently Asked Questions

Should all validation failures be blocking?

Not necessarily. Format failures (invalid JSON, syntax errors) are blockers—the model fundamentally misunderstood the task. Safety failures are usually blockers (safety > usability). For domain checks, set a severity level: "Incomplete function" might warn but not block; "function with syntax error" blocks.

How do I tune safety thresholds?

Start conservative: if you're unsure, filter more. Run your system with a lower threshold (e.g., toxicity > 0.5) and manually review false positives. Gradually raise the threshold until false positives are acceptable. Document your threshold and review monthly.

Can deterministic checks replace LLM-as-judge?

No. Deterministic checks catch format and safety failures; LLM-as-judge catches quality failures. A model output might be valid JSON, non-toxic, and correctly formatted but still be a bad answer. Use both.

What if my domain has no obvious "bad" pattern?

Start with format and safety checks. Run your system, collect failures, and look for patterns. After 100–1000 outputs, you'll see what goes wrong. Codify those patterns into domain rules.

How do I maintain these checks as my task evolves?

Version your validation harness alongside your model. If you add a new constraint (e.g., "output must mention the source"), update the rules and re-validate your test set. Document each change.

Further Reading