Building a serverless ETL pipeline that efficiently loads structured data into Amazon Aurora is a common requirement. AWS Lambda, combined with DuckDB, provides a powerful way to perform in-memory analytics and bulk insert data into Aurora.
This guide demonstrates how to:
- Read a CSV file from Amazon S3
- Process the data in-memory using DuckDB
- Securely connect to Aurora using IAM authentication
- Perform efficient batch inserts into Aurora
This solution is optimized for performance, scalability, and security.
- Amazon S3: Stores the CSV file.
- AWS Lambda: Executes the ETL logic.
- Amazon Aurora (MySQL/PostgreSQL): Stores the transformed data.
Since AWS Lambda does not include DuckDB by default, it must be packaged as a Lambda layer.
To create the layer:
mkdir python
pip install duckdb -t python/
zip -r duckdb_layer.zip python
Upload duckdb_layer.zip
as an AWS Lambda Layer and attach it to your Lambda function.
To allow Lambda to access S3 and Aurora, attach the following IAM policy to the Lambda function role:
{
"Effect": "Allow",
"Action": ["s3:GetObject", "rds-db:connect"],
"Resource": [
"arn:aws:s3:::my-bucket/*",
"arn:aws:rds-db:us-east-1:123456789012:dbuser:mydb/iam_user"
]
}
The following Python code for AWS Lambda reads a CSV file from S3, processes it using DuckDB, and bulk inserts it into Amazon Aurora.
import boto3
import duckdb
import os
# AWS Configuration
S3_BUCKET = "my-bucket"
S3_FILE_KEY = "data.csv"
DB_HOST = "aurora-cluster.cluster-xyz.us-east-1.rds.amazonaws.com"
DB_NAME = "mydb"
DB_USER = "iam_user"
DB_PORT = 3306
# AWS Clients
s3_client = boto3.client("s3")
rds_client = boto3.client("rds")
def lambda_handler(event, context):
try:
# Generate IAM authentication token for Aurora
token = rds_client.generate_db_auth_token(
DBHostname=DB_HOST,
Port=DB_PORT,
DBUsername=DB_USER
)
# Download CSV file from S3
file_path = f"/tmp/{S3_FILE_KEY}"
s3_client.download_file(S3_BUCKET, S3_FILE_KEY, file_path)
# Load CSV data into DuckDB
con = duckdb.connect(database=":memory:")
con.execute(f"CREATE TABLE temp AS SELECT * FROM read_csv_auto('{file_path}')")
# Process data (example: filter active users)
processed_data = con.execute("SELECT name, email FROM temp WHERE active=1")
# Connect DuckDB to Aurora using MySQL Connector
con.execute(f"""
INSTALL mysql;
LOAD mysql;
ATTACH DATABASE 'mysql://{DB_USER}:{token}@{DB_HOST}:{DB_PORT}/{DB_NAME}' AS aurora;
""")
# Perform batch insert into Aurora
con.execute("INSERT INTO aurora.users SELECT name, email FROM temp")
return {"statusCode": 200, "body": f"Inserted {processed_data.row_count} rows into Aurora"}
except Exception as e:
return {"statusCode": 500, "body": str(e)}
- Uses
generate_db_auth_token()
to get a temporary authentication token for Aurora. - No hardcoded credentials, ensuring security.
- Downloads the file from Amazon S3 into Lambda’s
/tmp
storage.
- Uses
read_csv_auto()
to load CSV directly into an in-memory DuckDB table.
- Runs SQL transformations, such as filtering records.
- Runs the following commands to establish a direct connection:
INSTALL mysql; LOAD mysql; ATTACH DATABASE 'mysql://{DB_USER}:{token}@{DB_HOST}:{DB_PORT}/{DB_NAME}' AS aurora;
- Uses a single query to insert all records:
INSERT INTO aurora.users SELECT name, email FROM temp;
- This method is more efficient than inserting row by row.
Optimization | Benefit |
---|---|
DuckDB in-memory processing | Faster than using Pandas |
Batch insert into Aurora | Reduces network overhead |
Using Parquet instead of CSV (optional) | Faster and smaller file size |
IAM authentication for Aurora | No need for hardcoded passwords |
mkdir python
pip install duckdb -t python/
zip -r duckdb_layer.zip python
- Upload
duckdb_layer.zip
to AWS Lambda Layers. - Attach it to your Lambda function.
- Apply the IAM policy for S3 and Aurora access as described in Step 1.
- Memory: At least 512MB for DuckDB processing.
- Timeout: At least 30 seconds for handling large inserts.
Step | Action |
---|---|
1 | Fetch CSV from S3 and load into DuckDB |
2 | Process and filter data in DuckDB |
3 | Securely authenticate with Aurora using IAM |
4 | Bulk insert data into Aurora |
5 | Optimize performance using batch operations |
This fully serverless approach provides a scalable, cost-effective, and secure ETL pipeline without managing infrastructure.
Using AWS Lambda with DuckDB provides a powerful and efficient way to process structured data and load it into Amazon Aurora. This solution eliminates the need for additional data processing infrastructure while maintaining high performance. It leverages IAM authentication for security and batch inserts for efficiency.
This approach is well-suited for use cases such as real-time analytics, serverless ETL, and log processing. It allows organizations to run complex transformations at scale without additional infrastructure costs.