What's Mongo Data Federation? How to Import Data from AWS S3 bucket to MongoDB Atlas cluster with Data Federation

0

What's Mongo Data Federation?

MongoDB Data Federation is a feature of MongoDB Atlas that lets you query data across multiple data sources as if they were one MongoDB databasewithout moving or duplicating the data.

Think of it as a virtual MongoDB layer on top of different storage systems.


What problem does it solve?

Normally, MongoDB queries only work on data stored inside MongoDB clusters.
Data Federation lets you run MongoDB Query Language (MQL) on:

  • MongoDB Atlas clusters

  • Data stored in cloud object storage (AWS S3, Azure Blob Storage, Google Cloud Storage)

  • Multiple clusters and storage sources at once

    👉 No ETL (Extract – Transform – Load). No data replication.

MongoDB Atlas cluster with Data Federation - S3
H1 - MongoDB Data Federation

Key features

1. Query across multiple sources
  • Join Atlas collections with S3 data

  • Combine historical cold data + hot operational data

2. Schema-on-read
  • No need to predefine schema

  • MongoDB infers structure at query time

3. Cost-efficient analytics
  • Keep old data in cheap object storage

  • Query only when needed

4. Read-only (important)
  • Data Federation is query-only

  • You cannot write/update/delete data through it

Import Data from S3 bucket to MongoDB Atlas cluster Step by Step.

H2 - Import S3 data to MongoDB with Data Federation

Goal:  Import multiple files .json / .json.gz from S3 → Atlas Cluster.

Case Study: Import files s3://import-bucket/data/user-devices/filexxxx.json.gz to MongoDB Atlas 
collection "user-devices". 

STEP 1 — Create a Federated Database Instance

Create a Federated Database Instance 1


H3 - Create a Federated Database Instance

  1. Go to Atlas Console Cluster Project

  2. Left Menu → Data Federation

  3. Click “Create a Federated Database Instance”

  4. Select Set up manually 

  5. Config Cloud provider & Data Source choose AWS provider, and input a instance name 
    for ex: "FederatedDatabaseInstance0" and add a Data Source.

Create a Federated Database Instance

Add Data Sourece Federated Database Instance
Config Data Sourece Federated Database Instance

H4 - Config cloud provider and data source

Click "Next" to Add Atlas to the trust relationships of your AWS IAM role:
You will see 2 fields:

Atlas AWS account ARN: #yourAccountARN
Your unique external ID: #yourExternalID

We will use these fields to create a AWS Role. It describes the trust relationships that allows Atlas to assume your new AWS IAM role and policy S3 access. 

Create IAM Policy

File: s3-mongo-policy.json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": ["arn:aws:s3:::<your bucket>"]
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": ["arn:aws:s3:::<your bucket>/*"]
    }
  ]
} 

Create policy use AWS CLI:

aws iam create-policy \
  --policy-name MongoAtlasDataFederationS3 \
  --policy-document file://s3-mongo-policy.json

Create IAM Role for Atlas

File: atlas-trust.json

  {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "#yourAccountARN"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "#yourExternalID"
}
}
}
]    
}
  
Create IAM Role use AWS CLI:

  aws iam create-role \
  --role-name MongoAtlasFederationRole \
  --assume-role-policy-document file://atlas-trust.json

Attach policy:


aws iam attach-role-policy \
  --role-name MongoAtlasFederationRole \
  --policy-arn arn:aws:iam::<YOUR_AWS_ID>:policy/MongoAtlasDataFederationS3 
  
 

After the atlas-trust role created successfully, we will get a rolr ARN and use to access AWS S3 for next step.

Config S3 Data Sourece Federated Database Instance

Enter your bucket name and prefix path.

Finally we have a json file config like this:

{
  "databases": [
    {
      "collections": [
        {
          "name": "user-devices",
          "dataSources": [
            {
              "path": "/",
              "storeName": "s3_store"
            }
          ]
        }
      ],
      "name": "your-federation-db", //your federationDB
"views": [] } ], "stores": [ { "bucket": "import-bucket", "delimiter": "/", "name": "s3_store", "prefix": "data/user-devices/", #prefix path to json / json.gz files "provider": "s3", "region": "us-east-1" } ] }
Now we config successfully we can connect to the Federation Instance by a uri connection and query to verify data. 

db.getCollection("user-devices").countDocuments();

STEP 2 — Run pipeline from Federated collection (S3) import to Cluster DB 

Connect to Federated Instance and run query

Import $out: replace all collection data

db.getCollection("user-devices").aggregate([
  { $out: { atlas: {
        clusterName: "your-cluster-name",
        db: "your-real-db",
        coll: "your-real-collection"
      }} }
]);

Import $merge: merge / upsert collection data

  
  db.getCollection("user-devices").aggregate([
  { $project: { _id: 0 } },

  // set a customId is _id
  { $set: { _id: "$customId" }},

  // remove customId use _id 
  { $unset: "customId" },

  // merge vào cluster khác
  {
    $merge: {
      into: {
        atlas: {
          clusterName: "your-cluster-name",
          db: "your-real-db",
          coll: "your-real-collection"
        }
      },
      on: "_id",              // phải dùng _id
      whenMatched: "replace", // replace toàn document
      whenNotMatched: "insert"
    }
  }
],
{
  pipelineOptions: { batchSize: 100 }
});

  
  

Post a Comment

0Comments
Post a Comment (0)

#buttons=(Accept !) #days=(20)

Our website uses cookies to enhance your experience. Learn More
Accept !
To Top