What's Mongo Data Federation?
MongoDB Data Federation is a feature of MongoDB Atlas that lets you query data across multiple data sources as if they were one MongoDB database—without moving or duplicating the data.
Think of it as a virtual MongoDB layer on top of different storage systems.
What problem does it solve?
Normally, MongoDB queries only work on data stored inside MongoDB clusters.
Data Federation lets you run MongoDB Query Language (MQL) on:
-
MongoDB Atlas clusters
-
Data stored in cloud object storage (AWS S3, Azure Blob Storage, Google Cloud Storage)
-
Multiple clusters and storage sources at once
👉 No ETL (Extract – Transform – Load). No data replication.
Key features
-
Join Atlas collections with S3 data
-
Combine historical cold data + hot operational data
-
No need to predefine schema
-
MongoDB infers structure at query time
-
Keep old data in cheap object storage
-
Query only when needed
-
Data Federation is query-only
-
You cannot write/update/delete data through it
Import Data from S3 bucket to MongoDB Atlas cluster Step by Step.
Goal: Import multiple files .json / .json.gz from S3 → Atlas Cluster.
STEP 1 — Create a Federated Database Instance
-
Go to Atlas Console Cluster Project
-
Left Menu → Data Federation
-
Click “Create a Federated Database Instance”
-
Select Set up manually
-
Config Cloud provider & Data Source choose AWS provider, and input a instance name
for ex: "FederatedDatabaseInstance0" and add a Data Source.
Click "Next" to Add Atlas to the trust relationships of your AWS IAM role:You will see 2 fields:
Atlas AWS account ARN: #yourAccountARN
Your unique external ID: #yourExternalID
We will use these fields to create a AWS Role. It describes the trust relationships that allows Atlas to assume your new AWS IAM role and policy S3 access.
Go to Atlas Console Cluster Project
Left Menu → Data Federation
Click “Create a Federated Database Instance”
Select Set up manually
Config Cloud provider & Data Source choose AWS provider, and input a instance name
for ex: "FederatedDatabaseInstance0" and add a Data Source.
Atlas AWS account ARN: #yourAccountARN
Your unique external ID: #yourExternalID
Create IAM Policy
File: s3-mongo-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": ["arn:aws:s3:::<your bucket>"]
},
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": ["arn:aws:s3:::<your bucket>/*"]
}
]
} Create policy use AWS CLI:
aws iam create-policy \
--policy-name MongoAtlasDataFederationS3 \
--policy-document file://s3-mongo-policy.json
Create IAM Role for Atlas
File: atlas-trust.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "#yourAccountARN"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "#yourExternalID"
}
}
}
]
}
Create IAM Role use AWS CLI:
aws iam create-role \
--role-name MongoAtlasFederationRole \
--assume-role-policy-document file://atlas-trust.json
Attach policy:
aws iam attach-role-policy \
--role-name MongoAtlasFederationRole \
--policy-arn arn:aws:iam::<YOUR_AWS_ID>:policy/MongoAtlasDataFederationS3
After the atlas-trust role created successfully, we will get a rolr ARN and use to access AWS S3 for next step.
Enter your bucket name and prefix path.Finally we have a json file config like this:
{
"databases": [
{
"collections": [
{
"name": "user-devices",
"dataSources": [
{
"path": "/",
"storeName": "s3_store"
}
]
}
],
"name": "your-federation-db", //your federationDB
"views": []
}
],
"stores": [
{
"bucket": "import-bucket",
"delimiter": "/",
"name": "s3_store",
"prefix": "data/user-devices/", #prefix path to json / json.gz files
"provider": "s3",
"region": "us-east-1"
}
]
}
Now we config successfully we can connect to the Federation Instance by a uri connection and query to verify data.
db.getCollection("user-devices").countDocuments();
Create policy use AWS CLI:
aws iam create-policy \
--policy-name MongoAtlasDataFederationS3 \
--policy-document file://s3-mongo-policy.json
Create IAM Role for Atlas
File: atlas-trust.json
Create IAM Role use AWS CLI:{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "#yourAccountARN" }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": "#yourExternalID" } } } ] }
aws iam create-role \
--role-name MongoAtlasFederationRole \
--assume-role-policy-document file://atlas-trust.json
Attach policy:
aws iam attach-role-policy \
--role-name MongoAtlasFederationRole \
--policy-arn arn:aws:iam::<YOUR_AWS_ID>:policy/MongoAtlasDataFederationS3
After the atlas-trust role created successfully, we will get a rolr ARN and use to access AWS S3 for next step.
Enter your bucket name and prefix path.
Finally we have a json file config like this:
{ "databases": [ { "collections": [ { "name": "user-devices", "dataSources": [ { "path": "/", "storeName": "s3_store" } ] } ], "name": "your-federation-db", //your federationDB
"views": [] } ], "stores": [ { "bucket": "import-bucket", "delimiter": "/", "name": "s3_store", "prefix": "data/user-devices/", #prefix path to json / json.gz files "provider": "s3", "region": "us-east-1" } ] }
db.getCollection("user-devices").countDocuments();
STEP 2 — Run pipeline from Federated collection (S3) import to Cluster DB
Connect to Federated Instance and run queryImport $out: replace all collection data
Import $merge: merge / upsert collection data
db.getCollection("user-devices").aggregate([
{ $project: { _id: 0 } },
// set a customId is _id
{ $set: { _id: "$customId" }},
// remove customId use _id
{ $unset: "customId" },
// merge vào cluster khác
{
$merge: {
into: {
atlas: {
clusterName: "your-cluster-name",
db: "your-real-db",
coll: "your-real-collection"
}
},
on: "_id", // phải dùng _id
whenMatched: "replace", // replace toàn document
whenNotMatched: "insert"
}
}
],
{
pipelineOptions: { batchSize: 100 }
});


