Skip to main content

Data Residency: Where Your Data Lives

Data residency is the requirement that certain types of personal or sensitive data must be physically stored within specific geographic boundaries. A company handling EU customer data must keep that data on servers located in the EU (or in approved third countries under adequacy decisions). Brazil's LGPD mandates that Brazilian citizen data remain in Brazil or on "protected servers" via international transfers. India's data localization rule requires that payments and financial transaction data stay within Indian borders. These regulations create significant operational challenges for AI teams: you cannot simply deploy a model to any cloud region; you must understand where your training data is stored, where inference happens, and whether that complies with law.

Understanding Data Residency Regulations

Data residency rules are not uniform; they vary by jurisdiction, data type, and use case. Below are the major regulatory frameworks:

GDPR (EU): Personal data of EU residents must be processed in the EU or a country with an "adequacy decision" (essentially approved by the European Commission to have equivalent privacy protections). The US is NOT approved; transfers to US-based services require a Standard Contractual Clause (SCC) or Binding Corporate Rules (BCR). Personal data can be transferred outside the EU only with explicit consent and a lawful basis (e.g., contract performance).

CCPA (California): California residents' personal information cannot be transferred to third parties without consent, and residents have the right to know where their data is stored. Unlike GDPR, CCPA doesn't mandate in-state storage, but transfers trigger disclosure and consent requirements.

LGPD (Brazil): Personal data of Brazilian residents must be stored in Brazil or via "international transfer" with explicit consent. Sensitive personal data (health, racial origin) must remain in Brazil unless there's an adequacy agreement or supplementary safeguards.

China's Data Localization: Personal information of Chinese citizens must be stored in China. Cross-border transfers require security assessment and user consent. This is why ChatGPT, Perplexity, and most Western LLMs operate a separate China instance.

India's Localization Rules: Payment system data and financial transactions must remain in India. Personal data can be transferred but certain categories (government IDs, financial records) face restrictions.

PIPEDA (Canada): Doesn't mandate in-country storage but requires safeguards proportionate to sensitivity. Data transfers out of Canada are allowed if the recipient is bound by equivalent privacy law.

JurisdictionRuleStrictnessExceptions
EU (GDPR)Personal data must stay in EU or approved countriesVery HighAdequacy decisions, SCCs, BCRs, consent
Brazil (LGPD)Must store in Brazil or via approved transferHighInternational transfer with consent
ChinaMust store in China; isolated infrastructureVery HighNo practical exceptions for foreign companies
IndiaPayment/financial data in India; personal data can transferMediumRestrictions on sensitive categories
California (CCPA)No storage mandate; disclosure and consent requiredMediumConsent, de-identified data
Canada (PIPEDA)No storage mandate; proportionate safeguardsLow-MediumEquivalent privacy law protection

Implications for AI Pipelines

Data residency requirements force distributed AI systems. Instead of one global training pipeline, you might need:

  • EU pipeline: Trains on EU customer data, runs inference within EU cloud regions.
  • US pipeline: Trains on US customer data, runs on US cloud providers.
  • China pipeline: Entirely separate infrastructure, separate models.

This increases operational complexity and cost. A multinational AI platform might need to train separate models (or use federated learning, Article 10) because unified training violates residency rules.

The Role of Standard Contractual Clauses (SCCs)

When you must transfer data out of the EU, you use a Standard Contractual Clause (SCC)—a contract template approved by the European Commission that promises the recipient (even in a non-approved country) will apply equivalent data protection. However, SCCs don't provide the legal certainty they once did. In 2023, the EDPB (European Data Protection Board) issued guidance that SCCs alone may not be enough if the recipient country's government can legally compel data disclosure (like US law enforcement under FISA). Many companies now add technical measures (encryption, access controls) to SCCs.

Code Example: Data Residency Policy Enforcement

Below is a system that enforces data residency by validating dataset location against allowed regions:

from enum import Enum
from typing import List, Set
from dataclasses import dataclass
from datetime import datetime

class DataClassification(Enum):
PUBLIC = "public"
INTERNAL = "internal"
SENSITIVE = "sensitive"
RESTRICTED = "restricted" # PII, health, financial

class Region(Enum):
US_EAST = "us-east-1"
EU_WEST = "eu-west-1"
CHINA = "cn-north-1"
INDIA = "ap-south-1"
CANADA = "ca-central-1"

class Jurisdiction(Enum):
EU = "eu"
US = "us"
CHINA = "china"
INDIA = "india"
CANADA = "canada"

@dataclass
class ResidencyRequirement:
"""Define residency rules for a jurisdiction."""
jurisdiction: Jurisdiction
allowed_regions: Set[Region]
description: str
sensitive_data_only: bool = False # True if rule applies only to sensitive data

# Define residency rules per jurisdiction
RESIDENCY_RULES = {
Jurisdiction.EU: ResidencyRequirement(
jurisdiction=Jurisdiction.EU,
allowed_regions={Region.EU_WEST}, # Only EU regions
description="GDPR: Personal data must remain in EU",
sensitive_data_only=False
),
Jurisdiction.US: ResidencyRequirement(
jurisdiction=Jurisdiction.US,
allowed_regions={Region.US_EAST, Region.EU_WEST}, # US or EU (via SCC)
description="CCPA: US customer data; EU via SCC",
sensitive_data_only=False
),
Jurisdiction.CHINA: ResidencyRequirement(
jurisdiction=Jurisdiction.CHINA,
allowed_regions={Region.CHINA}, # Only China
description="China Data Localization: Must remain in China",
sensitive_data_only=False
),
Jurisdiction.INDIA: ResidencyRequirement(
jurisdiction=Jurisdiction.INDIA,
allowed_regions={Region.INDIA, Region.US_EAST}, # India or US (for non-sensitive)
description="LGPD-adjacent: Payment data in India",
sensitive_data_only=True # Only restricts sensitive data
),
}

@dataclass
class Dataset:
"""A dataset with residency constraints."""
dataset_id: str
name: str
jurisdiction: Jurisdiction
classification: DataClassification
current_region: Region
created_at: str

class ResidencyValidator:
"""Enforce data residency requirements."""

def __init__(self):
self.rules = RESIDENCY_RULES
self.audit_log = []

def validate_dataset_location(self, dataset: Dataset) -> tuple[bool, str]:
"""
Check if dataset is in a compliant region.
Returns: (compliant, message)
"""
rule = self.rules.get(dataset.jurisdiction)
if rule is None:
return False, f"Unknown jurisdiction: {dataset.jurisdiction}"

# Check if rule applies to this data classification
if rule.sensitive_data_only and dataset.classification == DataClassification.PUBLIC:
return True, f"Public data; residency rule doesn't apply"

# Check if current region is allowed
if dataset.current_region in rule.allowed_regions:
message = f"Dataset {dataset.name} compliant in region {dataset.current_region.value}"
self._log_validation(dataset.dataset_id, True, message)
return True, message
else:
message = f"Dataset {dataset.name} in non-compliant region {dataset.current_region.value}. " \
f"Allowed: {[r.value for r in rule.allowed_regions]}"
self._log_validation(dataset.dataset_id, False, message)
return False, message

def can_transfer_to_region(self, dataset: Dataset, target_region: Region) -> tuple[bool, str]:
"""Check if dataset can be transferred to a target region."""
rule = self.rules.get(dataset.jurisdiction)
if rule is None:
return False, f"Unknown jurisdiction"

if target_region in rule.allowed_regions:
message = f"Transfer approved: {dataset.name} can be stored in {target_region.value}"
self._log_validation(dataset.dataset_id, True, message)
return True, message
else:
message = f"Transfer denied: {target_region.value} not allowed for {dataset.jurisdiction.value} data"
self._log_validation(dataset.dataset_id, False, message)
return False, message

def get_allowed_regions(self, jurisdiction: Jurisdiction) -> List[str]:
"""List allowed regions for a jurisdiction."""
rule = self.rules.get(jurisdiction)
if rule:
return [r.value for r in rule.allowed_regions]
return []

def _log_validation(self, dataset_id: str, compliant: bool, message: str) -> None:
"""Log compliance check."""
log_entry = {
"timestamp": datetime.now().isoformat(),
"dataset_id": dataset_id,
"compliant": compliant,
"message": message
}
self.audit_log.append(log_entry)

# Example
validator = ResidencyValidator()

# Check EU customer data
eu_dataset = Dataset(
dataset_id="ds_eu_001",
name="EU Customer Database",
jurisdiction=Jurisdiction.EU,
classification=DataClassification.SENSITIVE,
current_region=Region.EU_WEST,
created_at="2026-01-15"
)

compliant, msg = validator.validate_dataset_location(eu_dataset)
print(f"EU Dataset Check: {msg}")

# Attempt to transfer to US (not allowed)
allowed, msg = validator.can_transfer_to_region(eu_dataset, Region.US_EAST)
print(f"Transfer to US: {msg}")

# Check China data
china_dataset = Dataset(
dataset_id="ds_cn_001",
name="China User Data",
jurisdiction=Jurisdiction.CHINA,
classification=DataClassification.RESTRICTED,
current_region=Region.CHINA,
created_at="2026-02-01"
)

compliant, msg = validator.validate_dataset_location(china_dataset)
print(f"China Dataset Check: {msg}")

This validator ensures datasets are stored only in compliant regions. You'd integrate it into your data ingestion pipeline: when uploading a dataset, check its jurisdiction and reject the upload if the target region is not allowed.

Code Example: Multi-Region Data Pipeline

For multinational platforms, you need separate data pipelines per region. Below is a simplified orchestrator:

from typing import Dict, List
from dataclasses import dataclass

@dataclass
class RegionalPipeline:
"""A data pipeline for a specific region."""
region: Region
jurisdiction: Jurisdiction
model_id: str # Model trained on this region's data
last_trained: str

class MultiRegionalOrchestrator:
"""Manage data pipelines across multiple jurisdictions."""

def __init__(self):
self.pipelines: Dict[Jurisdiction, RegionalPipeline] = {}
self.validator = ResidencyValidator()

def create_regional_pipeline(
self,
region: Region,
jurisdiction: Jurisdiction,
model_id: str
) -> None:
"""Set up a data pipeline for a region."""
pipeline = RegionalPipeline(
region=region,
jurisdiction=jurisdiction,
model_id=model_id,
last_trained="2026-01-01"
)
self.pipelines[jurisdiction] = pipeline
print(f"Created pipeline for {jurisdiction.value} in {region.value}")

def route_dataset(self, dataset: Dataset) -> str:
"""Route dataset to appropriate region."""
if dataset.jurisdiction not in self.pipelines:
return f"No pipeline for {dataset.jurisdiction.value}"

pipeline = self.pipelines[dataset.jurisdiction]

# Validate before routing
compliant, msg = self.validator.validate_dataset_location(dataset)
if not compliant:
return f"Cannot route: {msg}"

return f"Routing to {pipeline.region.value} pipeline (model: {pipeline.model_id})"

def get_pipeline_status(self) -> Dict:
"""Get status of all regional pipelines."""
status = {}
for jurisdiction, pipeline in self.pipelines.items():
status[jurisdiction.value] = {
"region": pipeline.region.value,
"model_id": pipeline.model_id,
"last_trained": pipeline.last_trained
}
return status

# Example: Multinational AI platform
orchestrator = MultiRegionalOrchestrator()

# Create regional pipelines
orchestrator.create_regional_pipeline(Region.EU_WEST, Jurisdiction.EU, "model_eu_v1")
orchestrator.create_regional_pipeline(Region.US_EAST, Jurisdiction.US, "model_us_v1")
orchestrator.create_regional_pipeline(Region.CHINA, Jurisdiction.CHINA, "model_cn_v1")

# Route datasets
eu_data = Dataset(
dataset_id="ds_eu_001",
name="EU Customer Data",
jurisdiction=Jurisdiction.EU,
classification=DataClassification.SENSITIVE,
current_region=Region.EU_WEST,
created_at="2026-01-15"
)

result = orchestrator.route_dataset(eu_data)
print(f"Routing result: {result}")

# Show pipeline status
print("\nPipeline Status:")
print(orchestrator.get_pipeline_status())

This approach ensures each jurisdiction's data is processed in compliance with regional laws.

Data Residency Pitfalls and Best Practices

Pitfall 1: Assuming all cloud regions are equivalent. AWS eu-west-1 and eu-central-1 are both in Europe, but they have different legal/regulatory standing. Consult your cloud provider's data residency documentation and your legal team.

Pitfall 2: Misunderstanding Standard Contractual Clauses. SCCs don't move your data; they contractually promise the recipient will apply equivalent privacy. But if that recipient (e.g., a US cloud provider) is compelled by US law to disclose data, the SCC may not protect you. EDPB guidance (June 2023) recommends adding technical measures (encryption, access controls) to SCCs.

Pitfall 3: Forgetting about backups and replication. You store data in eu-west-1, compliant. But the cloud provider automatically replicates backups to a US region for disaster recovery—violating GDPR. Check your cloud provider's backup policies and disable cross-region replication if needed.

Pitfall 4: Model inference as data transfer. You train in EU, export the model, run inference in the US. The model doesn't contain training data (it's weights), so inference is allowed. But if your inference pipeline logs or caches user inputs, that's data transfer and may violate residency rules.

Best practice: Map all data flows: where data is ingested, where it's stored, where it's processed, where models are trained/served, and where user inputs for inference are processed. Audit this map quarterly, especially after infrastructure changes.

Key Takeaways

  • Data residency mandates where personal or sensitive data must be physically stored: GDPR requires EU storage, Brazil requires Brazilian storage, China requires isolation, and other jurisdictions have varying rules.
  • Different jurisdictions have different rules: Some (EU, China) require in-country storage; others (US, Canada) focus on consent and safeguards.
  • Standard Contractual Clauses (SCCs) enable EU-to-US transfers but don't guarantee protection against government compulsion; add technical measures (encryption, access controls).
  • For multinational AI platforms, implement separate data pipelines per jurisdiction; a unified global pipeline often violates multiple laws simultaneously.
  • Audit data flows regularly: ingestion, storage, processing, training, inference, and backup locations must all comply with residency rules.

Frequently Asked Questions

Can I use a US cloud provider for EU customer data?

Yes, if you have a Standard Contractual Clause in place and add technical measures (encryption with EU-held keys, restricted access). However, EDPB guidance suggests this is increasingly risky due to US government access (FISA). Many companies now use EU-based cloud providers (Scaleway, OVH) or hire data processors certified under EU data protection standards.

What happens if I accidentally store data in a non-compliant region?

You're in violation. You must immediately transfer the data to a compliant region or delete it. You must notify affected individuals and your data protection authority (within 72 hours under GDPR). The penalties are severe: up to 4% of global annual revenue under GDPR. Always test your data residency enforcement in development before deploying to production.

Do I need a separate AI model for each jurisdiction?

Not necessarily. A single global model can be used if training data is anonymized and residency rules are about storage, not processing. However, if different jurisdictions have different data practices (e.g., China collects different data than the EU), separate models may perform better and simplify compliance. Federated learning (Article 10) offers a middle ground: train a global model without centralizing data.

If the data subject consents, or if the transfer is necessary for contract performance, or if there's a legal obligation. Standard Contractual Clauses are used when none of these apply. However, recent EDPB decisions have made SCCs harder to rely on; many companies now use Binding Corporate Rules (BCRs) or seek approval under new adequacy frameworks.

Further Reading