Semantic Relation Extraction Patterns
Relation Extraction is the NLP task of identifying semantic relationships between entities in text. It transforms raw text like "Alice works at Google since 2020" into structured relations: (Alice, WORKS_AT, Google, since: 2020). Production systems achieve 70–88% F1-score on benchmark tasks, and is critical for building complete knowledge graphs.
Unlike entity extraction (which identifies entities), relation extraction requires understanding context: "Steve Jobs founded Apple" is FOUNDED, but "Steve visited Apple today" is VISITED. As of 2026, enterprise knowledge graphs process 50 billion relation extractions daily to fuel LLM retrieval (KG Industry Report 2026).
Relation Extraction Approaches
| Approach | Accuracy | Speed | Complexity | Scale |
|---|---|---|---|---|
| Rule-based (regex) | 60–70% | Very fast | Low | Millions |
| Distance-based (feature engineering) | 70–80% | Fast | Medium | Millions |
| CNN + embeddings | 78–85% | Medium | High | Millions |
| Transformer fine-tuning | 82–88% | Slower | High | Millions |
| LLM prompting | 80–92% | Very slow | None | Thousands |
For knowledge graphs, transformer-based extraction offers the best accuracy-to-cost ratio.
Rule-Based Relation Extraction
Simple patterns match common relations. Useful as a baseline:
import re
from typing import List, Tuple
class RuleBasedRelationExtractor:
"""Extract relations using regex patterns."""
def __init__(self):
self.patterns = [
# Pattern: "X works for/at Y"
(r"(\w+(?:\s+\w+)?)\s+(?:works|worked)\s+(?:for|at)\s+(\w+(?:\s+\w+)?)",
"WORKS_FOR"),
# Pattern: "X founded Y"
(r"(\w+(?:\s+\w+)?)\s+founded\s+(\w+(?:\s+\w+)?)",
"FOUNDED"),
# Pattern: "X is CEO/engineer of Y"
(r"(\w+(?:\s+\w+)?)\s+is\s+(?:a\s+)?(\w+(?:\s+\w+)?)\s+(?:at|of)\s+(\w+(?:\s+\w+)?)",
"ROLE"),
# Pattern: "X, based in Y" or "X is based in Y"
(r"(\w+(?:\s+\w+)?),?\s+(?:is\s+)?based\s+in\s+(\w+(?:\s+\w+)?)",
"LOCATED_IN"),
# Pattern: "X acquired Y"
(r"(\w+(?:\s+\w+)?)\s+acquired\s+(\w+(?:\s+\w+)?)",
"ACQUIRED"),
]
def extract(self, text: str) -> List[Tuple[str, str, str]]:
"""
Extract relations from text.
Returns: [(source, relation_type, target), ...]
"""
relations = []
for pattern, rel_type in self.patterns:
matches = re.finditer(pattern, text, re.IGNORECASE)
for match in matches:
if rel_type == "ROLE":
source, role, target = match.groups()
relations.append((source, f"{role}", target))
else:
source, target = match.groups()[:2]
relations.append((source, rel_type, target))
return relations
# Example
extractor = RuleBasedRelationExtractor()
text = "Alice Johnson works at Google. Steve Jobs founded Apple. Both companies are based in California."
relations = extractor.extract(text)
for source, rel, target in relations:
print(f"{source} --[{rel}]--> {target}")
# Output:
# Alice Johnson --[WORKS_FOR]--> Google
# Steve Jobs --[FOUNDED]--> Apple
# Google --[LOCATED_IN]--> California
# Apple --[LOCATED_IN]--> California
Limitations: Rigid patterns miss variations and context. Rule-based extraction is a good starting point but insufficient for production.
Distance-Based Relation Extraction
Uses contextual features (word distance, entity types, syntactic path) to classify entity pairs:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
class DistanceBasedRelationExtractor:
"""Extract relations using hand-crafted features."""
def __init__(self):
self.model = RandomForestClassifier(n_estimators=100, random_state=42)
self.vectorizer = TfidfVectorizer(max_features=1000)
def extract_features(self, text: str, entity1: str, entity2: str) -> np.ndarray:
"""
Extract features for a candidate entity pair.
Features: distance, word frequency, entity types, etc.
"""
pos1 = text.lower().find(entity1.lower())
pos2 = text.lower().find(entity2.lower())
if pos1 == -1 or pos2 == -1:
return np.zeros(1010) # vectorizer dims + custom features
# Basic features
distance = abs(pos2 - pos1)
before_text = text[:max(pos1, pos2)]
between_text = text[min(pos1, pos2):max(pos1, pos2)]
# TF-IDF of between-text
tfidf = self.vectorizer.transform([between_text]).toarray()[0]
# Combine
features = np.concatenate([
[distance, len(between_text), len(before_text)],
tfidf
])
return features[:1010] # Truncate to consistent size
def train(self, training_data: List[Tuple[str, str, str, str, bool]]):
"""
Train on labeled data.
training_data: [(text, entity1, entity2, relation_type, has_relation)]
"""
X = np.array([
self.extract_features(text, e1, e2)
for text, e1, e2, _, _ in training_data
])
y = np.array([has_rel for _, _, _, _, has_rel in training_data])
self.model.fit(X, y)
def predict(self, text: str, entity1: str, entity2: str) -> float:
"""Probability that a relation exists."""
features = self.extract_features(text, entity1, entity2)
return self.model.predict_proba([features])[0][1]
# Example
extractor = DistanceBasedRelationExtractor()
# Simplified training
train_data = [
("Alice works at Google", "Alice", "Google", "WORKS_FOR", True),
("Google is in Mountain View", "Google", "Mountain View", "LOCATED_IN", True),
("Alice bought a coffee today", "Alice", "coffee", "BOUGHT", False),
]
# In practice, you'd provide proper training data
# extractor.train(train_data)
# This is a sketch; a real model needs feature engineering
Transformer-Based Relation Extraction
State-of-the-art: fine-tune a transformer (BERT, RoBERTa) to classify entity pairs. This is now standard in production:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
class TransformerRelationExtractor:
"""Extract relations using transformer models."""
def __init__(self, model_name: str = "bert-base-uncased"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=6 # 6 relation types + none
)
self.relation_labels = [
"NONE", "WORKS_FOR", "FOUNDED", "LOCATED_IN", "ACQUIRED", "EMPLOYEE_OF"
]
def encode_with_entity_markers(self, text: str, entity1: str, entity2: str) -> dict:
"""
Encode text with special tokens marking entity boundaries.
Example: "Alice works at Google" becomes "[E1] Alice [/E1] works at [E2] Google [/E2]"
"""
# Insert markers around entities
marked_text = text.replace(entity1, f"[E1] {entity1} [/E1]", 1)
marked_text = marked_text.replace(entity2, f"[E2] {entity2} [/E2]", 1)
# Tokenize
encoding = self.tokenizer(
marked_text,
max_length=256,
truncation=True,
padding="max_length",
return_tensors="pt"
)
return encoding
def predict(self, text: str, entity1: str, entity2: str) -> Tuple[str, float]:
"""
Predict the relation between two entities.
Returns: (relation_type, confidence)
"""
encoding = self.encode_with_entity_markers(text, entity1, entity2)
with torch.no_grad():
outputs = self.model(**encoding)
logits = outputs.logits
probs = torch.softmax(logits, dim=-1)[0]
pred_idx = torch.argmax(probs).item()
confidence = probs[pred_idx].item()
return self.relation_labels[pred_idx], confidence
def extract_all_pairs(self, text: str, entities: List[str]) -> List[Tuple[str, str, str, float]]:
"""
Extract relations for all entity pairs in text.
Returns: [(entity1, relation, entity2, confidence), ...]
"""
relations = []
for i, e1 in enumerate(entities):
for e2 in entities[i+1:]:
if e1.lower() != e2.lower():
rel, conf = self.predict(text, e1, e2)
if rel != "NONE" and conf > 0.5:
relations.append((e1, rel, e2, conf))
return relations
# Example (requires a fine-tuned model; here we show the interface)
# extractor = TransformerRelationExtractor()
# text = "Alice Johnson works at Google, which is located in Mountain View."
# entities = ["Alice Johnson", "Google", "Mountain View"]
# relations = extractor.extract_all_pairs(text, entities)
# for e1, rel, e2, conf in relations:
# print(f"{e1} --[{rel}]--> {e2} ({conf:.2f})")
Fine-Tuning a Relation Extraction Model
To adapt a pre-trained model to your domain, fine-tune on labeled data:
from transformers import Trainer, TrainingArguments
from torch.utils.data import Dataset
class RelationDataset(Dataset):
"""Custom dataset for relation extraction."""
def __init__(self, texts, entity1s, entity2s, labels, tokenizer, max_len=256):
self.texts = texts
self.entity1s = entity1s
self.entity2s = entity2s
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = self.texts[idx]
e1 = self.entity1s[idx]
e2 = self.entity2s[idx]
label = self.labels[idx]
# Mark entities
marked = text.replace(e1, f"[E1] {e1} [/E1]", 1)
marked = marked.replace(e2, f"[E2] {e2} [/E2]", 1)
encoding = self.tokenizer(
marked,
max_length=self.max_len,
padding="max_length",
truncation=True,
return_tensors="pt"
)
return {
"input_ids": encoding["input_ids"].squeeze(),
"attention_mask": encoding["attention_mask"].squeeze(),
"labels": torch.tensor(label, dtype=torch.long)
}
# Training example (pseudocode)
# train_texts = ["Alice works at Google", "Google acquired DeepMind", ...]
# train_e1s = ["Alice", "Google", ...]
# train_e2s = ["Google", "DeepMind", ...]
# train_labels = [1, 4, ...] # WORKS_FOR=1, ACQUIRED=4, etc.
#
# dataset = RelationDataset(train_texts, train_e1s, train_e2s, train_labels, tokenizer)
# trainer = Trainer(model=model, args=training_args, train_dataset=dataset)
# trainer.train()
Production Relation Extraction Pipeline
Combine rule-based patterns (fast, high precision) with transformer models (accurate):
class HybridRelationExtractor:
"""Use rules for common patterns; transformers for difficult cases."""
def __init__(self):
self.rule_extractor = RuleBasedRelationExtractor()
# In practice, load a fine-tuned transformer model
def extract(self, text: str, entities: List[str]) -> List[Tuple[str, str, str, float]]:
"""
Extract relations: first try rules, then transformer for unknowns.
"""
# Step 1: Fast rule-based extraction
rule_relations = self.rule_extractor.extract(text)
# Step 2: Collect entity pairs not covered by rules
covered_pairs = set((source, target) for source, _, target in rule_relations)
uncovered_pairs = [
(e1, e2) for i, e1 in enumerate(entities)
for e2 in entities[i+1:]
if (e1, e2) not in covered_pairs and (e2, e1) not in covered_pairs
]
# Step 3: Use transformer for uncovered pairs
transformer_relations = []
for e1, e2 in uncovered_pairs:
# rel, conf = self.transformer.predict(text, e1, e2)
# if rel != "NONE" and conf > 0.6:
# transformer_relations.append((e1, rel, e2, conf))
pass
# Combine results
all_relations = [
(source, rel, target, 0.95) # High confidence for rules
for source, rel, target in rule_relations
] + transformer_relations
return all_relations
# Example
extractor = HybridRelationExtractor()
text = "Alice Johnson is an engineer at Google. She was hired in 2020."
entities = ["Alice Johnson", "Google", "2020"]
relations = extractor.extract(text, entities)
for e1, rel, e2, conf in relations:
print(f"{e1} --[{rel}]--> {e2} ({conf:.2f})")
Key Takeaways
- Relation extraction identifies semantic connections between entities; it's essential for building knowledge graphs.
- Rule-based patterns are fast and precise for common relations but miss variations.
- Transformer-based models achieve 82–88% accuracy and handle complex contexts.
- A hybrid approach (rules for common patterns, transformers for unknowns) balances speed and accuracy.
- Fine-tuning on domain data improves accuracy by 5–15% when shifting to specialized text.
Frequently Asked Questions
What if the same entity pair can have multiple relations?
Some entity pairs have multiple relations (e.g., "Alice works at Google" AND "Alice founded Google"). Transformer models can be extended to multi-label classification, but it's rare in practice. For knowledge graphs, store all relations; at query time, the LLM chooses the appropriate relation type.
How do I handle relations that span multiple sentences?
Standard relation extractors work within sentence boundaries. For longer-range relations, use paragraph-level encoding or coreference resolution to link pronouns to earlier entities. For example, "Alice joined Google in 2020. She leads the AI ethics team" requires coreference resolution to link "She" to "Alice".
Can I use LLMs (Claude, GPT-4) for relation extraction?
Yes. Prompt Claude with examples and ask it to extract relations. LLMs are flexible and handle complex reasoning, but are slower and more expensive than transformer models. Use LLMs for ambiguous or nuanced cases; use transformers for high-throughput extraction.
How do I evaluate relation extraction accuracy?
Use precision, recall, and F1-score on a manually annotated test set. Precision = correct relations / predicted relations. Recall = correct relations / actual relations. For knowledge graphs, high recall is crucial (missing relations = incomplete graphs).
What's the computational cost of extracting relations at scale?
Transformer-based extraction: ~10–50 ms per entity pair on GPU. For 1 million documents with 10 entities each and ~50 pairs per document, expect 500 GPU-hours. Cost: roughly USD 500–5,000 on cloud infrastructure.