Privacy-Preserving ML: Federated Learning & DP
Privacy-preserving machine learning (PPML) shifts the paradigm: instead of collecting all data centrally to train a model, training happens at the edge (on users' devices or local organizations) and only model updates are shared. Federated learning trains a global model by averaging updates from thousands of participants without centralizing raw data. Differential privacy adds noise to updates so that the model output reveals nothing specific about any individual. Secure aggregation uses cryptographic protocols so the server aggregating updates learns only the aggregate, not individual updates. Together, these techniques enable AI on sensitive data (health, finance, personal) without violating privacy regulations or risking breaches.
Federated Learning: Training Without Centralizing Data
Federated learning addresses a fundamental problem: organizations want to collaborate on AI (build a better fraud detection model by pooling bank data) but cannot share raw data due to privacy, regulation, or competition. In federated learning, each participant:
- Trains a local model on their own data (no sharing).
- Sends only the model weights (not the data) to a central server.
- The server averages weights from all participants to create a global model.
- The global model is sent back to participants.
- Iterate until convergence.
Example: Five hospitals build a shared diagnostic model. Hospital A trains locally on their patients, generating weight updates. They send updates (not patient data) to a coordinator. Hospitals B–E do the same. The coordinator averages all updates and sends the global model back. Each hospital retains all their data; the coordinator never sees raw data.
Advantages: No data sharing, regulatory compliance (data stays in-country), reduced breach risk (no central data store). Disadvantages: Slower convergence (fewer data samples per round), statistical heterogeneity (data distributions vary across participants), communication overhead (frequent network transfers of large models).
Practical challenge: If a participant has only 10 samples and a model has 10 million parameters, the local update is very noisy and may hurt the global model. Federated learning works best with diverse participants contributing substantial data.
Differential Privacy: Formal Privacy Guarantees
Differential privacy (Article 3) adds calibrated noise so the model output is almost identical whether a specific individual's data is included or not. Formally: an algorithm is DP if the probability distribution of its output is nearly the same for datasets D and D' (differing by one record).
The privacy budget epsilon (ε) quantifies privacy loss. Smaller epsilon = stricter privacy (less information leakage). Common thresholds:
ε < 0.5: Very strong privacy (minimal information leakage)ε < 1.0: Strong privacy (high bar for academic/regulatory use)ε < 3.0: Reasonable privacy (often sufficient for business use)ε > 3.0: Weak privacy (marginal benefit over non-private model)
In federated learning, DP is applied per round: before sending local updates to the server, add noise calibrated by ε. The server then aggregates noisy updates. After K rounds, total epsilon is K × per-round epsilon.
Trade-off: Higher privacy (lower epsilon) requires more noise, reducing model accuracy. You must balance privacy and utility. A medical diagnostic model with 60% accuracy is useless, but so is a model that leaks patient data. Determining the right epsilon requires collaboration between privacy researchers and domain experts.
Code Example: Federated Learning with Differential Privacy
Below is a simplified federated learning system using TensorFlow Federated (TFF):
import numpy as np
from typing import List, Tuple
from dataclasses import dataclass
@dataclass
class FederatedParticipant:
"""A participant (device, organization) in federated learning."""
participant_id: str
local_data_size: int
data: np.ndarray # Local training data (features only, not labels)
class FederatedLearningServer:
"""Coordinate federated learning rounds."""
def __init__(self, num_features: int, learning_rate: float = 0.01):
self.num_features = num_features
self.learning_rate = learning_rate
self.global_weights = np.random.randn(num_features) * 0.01
self.round = 0
def train_round(
self,
participants: List[FederatedParticipant],
epsilon: float = 1.0
) -> Tuple[np.ndarray, float]:
"""
Execute one federated learning round.
Returns: (global_weights, privacy_loss_this_round)
"""
self.round += 1
# Step 1: Distribute global model
for participant in participants:
participant.local_weights = self.global_weights.copy()
# Step 2: Local training (simplified: gradient update)
local_updates = []
for participant in participants:
# Simulate local training: compute gradient
local_gradient = self._compute_local_gradient(participant)
# Apply differential privacy: add noise
noise = np.random.laplace(0, 1.0 / epsilon, size=local_gradient.shape)
private_gradient = local_gradient + noise
local_updates.append(private_gradient)
# Step 3: Secure aggregation (simplified: average)
avg_update = np.mean(local_updates, axis=0)
# Step 4: Update global model
self.global_weights -= self.learning_rate * avg_update
return self.global_weights, epsilon
def _compute_local_gradient(self, participant: FederatedParticipant) -> np.ndarray:
"""Simulate gradient computation on local data."""
# Simplified: gradient = mean(data) - weights
local_mean = np.mean(participant.data, axis=0)
gradient = local_mean - self.global_weights
return gradient
class DifferentialPrivacyAnalyzer:
"""Analyze privacy budget consumption."""
def __init__(self, total_epsilon_budget: float = 10.0):
self.total_budget = total_epsilon_budget
self.used = 0.0
def consume_epsilon(self, epsilon_per_round: float, num_rounds: int) -> bool:
"""
Check if we can run num_rounds with epsilon_per_round.
Returns True if within budget.
"""
total_epsilon = epsilon_per_round * num_rounds
if self.used + total_epsilon > self.total_budget:
return False
self.used += total_epsilon
return True
def report_status(self) -> dict:
"""Report privacy budget status."""
remaining = self.total_budget - self.used
percentage = (self.used / self.total_budget) * 100
return {
"total_budget": self.total_budget,
"used": self.used,
"remaining": remaining,
"usage_percent": percentage
}
# Example: Federated learning with 5 hospitals
participants = [
FederatedParticipant(
participant_id=f"hospital_{i}",
local_data_size=100,
data=np.random.randn(100, 10) # 100 samples, 10 features
)
for i in range(5)
]
server = FederatedLearningServer(num_features=10, learning_rate=0.01)
privacy_analyzer = DifferentialPrivacyAnalyzer(total_epsilon_budget=10.0)
# Run federated learning rounds
epsilon_per_round = 0.5
num_rounds = 10
if privacy_analyzer.consume_epsilon(epsilon_per_round, num_rounds):
print(f"Running {num_rounds} federated rounds with ε={epsilon_per_round}/round")
for round_num in range(num_rounds):
weights, privacy_loss = server.train_round(participants, epsilon=epsilon_per_round)
print(f"Round {round_num + 1}: global weights updated, privacy loss={privacy_loss}")
else:
print("Insufficient privacy budget for requested training")
# Report privacy budget status
status = privacy_analyzer.report_status()
print(f"\nPrivacy Budget Status:")
print(f"Used: {status['used']:.2f} / {status['total_budget']:.2f} ({status['usage_percent']:.1f}%)")
print(f"Remaining: {status['remaining']:.2f}")
This simplified example shows the core concept: local training + noise injection + aggregation. Production systems (TensorFlow Federated, PySyft) add complexity: compression (reduce communication), asynchronous updates (participants train at different speeds), and sophisticated DP mechanisms (Renyi differential privacy).
Code Example: Secure Aggregation
In the federated learning example above, the server sees individual updates. A compromised server could invert updates to recover training data. Secure aggregation uses cryptographic protocols so the server learns only the aggregate (average), not individual updates.
Below is a simplified secure aggregation using secret sharing:
import numpy as np
from typing import List, Dict
class SecureAggregation:
"""Secure aggregation using Shamir's secret sharing (simplified)."""
def __init__(self, num_participants: int):
self.num_participants = num_participants
def aggregate_securely(self, updates: Dict[str, np.ndarray]) -> np.ndarray:
"""
Aggregate updates securely so server never sees individual updates.
Simplified version: participants split their update into secret shares,
send each share to a different server. Only the aggregate is reconstructed.
In practice, use multi-party computation (MPC) libraries.
"""
# Step 1: Each participant creates secret shares
num_servers = 3 # Assume 3 aggregation servers
# For simplicity, just average directly
# In production, use proper MPC: https://github.com/OpenMined/PySyft
aggregated = np.zeros_like(next(iter(updates.values())))
for participant_id, update in updates.items():
aggregated += update
aggregated /= len(updates)
return aggregated
def demo_secure_aggregation(self):
"""Demonstrate secure aggregation."""
updates = {
"hospital_1": np.array([0.1, 0.2, 0.3]),
"hospital_2": np.array([0.15, 0.25, 0.35]),
"hospital_3": np.array([0.12, 0.22, 0.32])
}
aggregate = self.aggregate_securely(updates)
print(f"Securely aggregated update: {aggregate}")
# Output: [0.123 0.223 0.323] (average of individual updates)
# Example
agg = SecureAggregation(num_participants=3)
agg.demo_secure_aggregation()
Real secure aggregation uses multi-party computation (MPC) libraries like PySyft or Crypten. This is where federated learning becomes truly privacy-preserving: individual updates are never transmitted in plaintext.
PPML Trade-offs and Real-World Challenges
Privacy-Utility Trade-off: Stronger privacy (lower epsilon, more noise) reduces model accuracy. A medical model with 99% accuracy using centralized data might achieve only 85% accuracy under strong privacy. Determining acceptable epsilon requires domain expertise and user acceptance.
Communication Cost: In federated learning, models (often millions of parameters) are transmitted in each round. Bandwidth becomes the bottleneck. Compression techniques (quantization, sketching) reduce communication but introduce additional approximation error.
Statistical Heterogeneity: Hospitals' patient data distributions differ significantly. A model trained by averaging updates may not perform well on any individual hospital's data. Personalization techniques (per-participant fine-tuning) help but require careful design.
Deployment Complexity: Federated learning requires infrastructure: coordinating distributed participants, handling dropped connections, managing versioning. Most companies start with simpler approaches (differential privacy on centralized data) before investing in federated learning.
Production Deployment Considerations
For 2026, PPML is production-ready in these scenarios:
-
Internal federated learning: Federate across company offices or business units (no external parties). Use TensorFlow Federated or a similar framework. Focus on data locality, not privacy guarantees (privacy is secondary to operational benefits).
-
Differential privacy on sensitive data: Train centrally on sensitive data (health, finance) using DP. Add noise calibrated to epsilon tolerance. Validate utility on representative test sets. Deploy with proper audit logging.
-
Third-party federated platforms: Use services like Google's Federated Learning as a Service (FL@H), Apple's on-device ML, or OpenMined's PySyft for small experiments. These are mature for specific domains (mobile keyboards, health monitoring) but less mature for arbitrary ML tasks.
-
Hybrid approach: Use federated learning for raw data (stays on-device), centralize model updates (sent to server), apply differential privacy to aggregated statistics. Best of both worlds: data privacy + simpler deployment.
Key Takeaways
- Federated learning trains models without centralizing data: participants train locally, send only weight updates to a server. Complies with regulations, reduces breach risk.
- Differential privacy adds noise to updates so model outputs don't leak information about specific individuals. Provides formal privacy guarantees; requires managing privacy budget.
- Secure aggregation uses cryptography so aggregation servers never see individual updates, only the aggregate. True privacy-preserving ML requires both federated learning and secure aggregation.
- PPML has costs: Reduced model accuracy (privacy-utility trade-off), increased communication overhead, deployment complexity. Use when privacy/compliance requirements justify the investment.
- Practical PPML in 2026: Start with differential privacy on sensitive data (simpler, immediate value). Graduate to federated learning for strategic collaborations or regulated domains (healthcare, finance).
Frequently Asked Questions
Can differential privacy really prevent re-identification?
Yes, under formal privacy definitions. If ε < 1, the presence or absence of any individual in training data has minimal impact on the model's behavior—an attacker cannot infer individual membership through inference or gradient attacks. However, DP assumes the attacker doesn't have auxiliary information (like knowing your friend works at a hospital and seeing a model trained on that hospital's data). In practice, combine DP with other mitigations (access controls, data redaction).
How do I choose an epsilon value?
Start with domain-specific guidance: healthcare recommends ε < 1, advertising recommends ε ≈ 8. Measure utility at different epsilon values (train multiple models with varying noise levels, evaluate on a representative test set). Find the threshold where utility becomes unacceptable. Use that epsilon minus a safety margin. For high-stakes decisions (medical, financial), favor privacy (lower epsilon) over utility. For lower-stakes tasks (recommendations), accept higher epsilon. Iterate with users/regulators.
Is federated learning a replacement for anonymization?
No. Federated learning keeps data distributed (not centralized), reducing breach risk. Anonymization makes data permanently unidentifiable (no personal data). Use both: federate across organizations, then anonymize/aggregate for long-term storage or research. Federated learning alone doesn't prevent inference attacks if a server is compromised; add differential privacy for formal guarantees.
What's the difference between federated learning and local differential privacy?
Federated learning keeps data decentralized; participants train locally and send updates. Data never leaves local devices/organizations. Local differential privacy adds noise at the source (on-device) so the server never sees true values. In federated + DP, participants add noise to their updates before sending, ensuring the server learns nothing about individuals even if aggregation is compromised.
Further Reading
- TensorFlow Federated Documentation: Production-grade federated learning framework with DP support.
- The Algorithmic Foundations of Differential Privacy (Dwork & Roth, 2014): Seminal textbook on differential privacy theory.
- PySyft: Decentralized Machine Learning: Open-source framework for federated learning and privacy-preserving ML.
- NIST Privacy-Enhancing Technologies: Government guidance on privacy-preserving techniques for data engineering.