Skip to content

Instantly share code, notes, and snippets.

View rmitula's full-sized avatar
☁️

Rafał Mituła rmitula

☁️
View GitHub Profile
SELECT changelog.product_id,
changelog.name,
changelog.price,
changelog._change_type,
changelog._commit_snapshot_id,
snapshots.committed_at
FROM "apache_iceberg_showcase"."products_changelog" AS changelog
INNER JOIN "apache_iceberg_showcase"."products$snapshots" as snapshots
ON changelog._commit_snapshot_id = snapshots.snapshot_id
WHERE "name" = 'Product A'
@rmitula
rmitula / job.py
Created July 28, 2023 07:27
Listing 6. Sample Python script in the AWS Glue Job that utilizes Apache Spark to run an Apache Iceberg procedure, creating a changelog table on Amazon S3 and updating the products_changelog table in the AWS Glue Data Catalog
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
# Initialize Spark and Glue context
sc = SparkContext()
glueContext = GlueContext(sc)
SELECT *
FROM "apache_iceberg_showcase"."products"
FOR TIMESTAMP AS OF TIMESTAMP '2023-06-14 13:51:00 UTC'
WHERE name = 'Product A';
SELECT *
FROM "apache_iceberg_showcase"."products"
FOR TIMESTAMP AS OF TIMESTAMP '2023-06-14 13:49:00 UTC'
WHERE name = 'Product A';
SELECT * FROM "apache_iceberg_showcase"."products$history";
@rmitula
rmitula / data.json
Created July 28, 2023 07:26
Listing 5. Updated products record of Product A stored in the Raw Data Zone under /day=02 partition.
{
"product_id": "29e17633-8d1e-4d63-8291-7a34fd79a4e5",
"name": "Product A",
"category": "Electronics",
"variants": [
{
"color": "black",
"size": "M",
"stock": 100
},
@rmitula
rmitula / gist:8958982dd7c5c54b425ac6012b5695e8
Created July 28, 2023 07:25
Listing 3. Amazon S3 Bucket structure under Curated Data Zone
curated-zone/
└── products/
├── data/
│ └── 00000-(...)-00001.parquet
└── metadata/
├── 00000-(...)-7c99bf9d1216.metadata.json
├── 5b9bf671-(...)-06c3dd2fe777.avro
@rmitula
rmitula / job.py
Created July 28, 2023 07:23
Listing 3. Sample Python script in AWS Glue Job leverages Apache Spark to transform JSON data from the Raw Data Zone into Apache Iceberg format in the Curated Data Zone, simultaneously updating the AWS Glue Data Catalog
import sys
import boto3
from pyspark.sql.functions import concat_ws, lpad
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
# Initialize Spark and Glue context
@rmitula
rmitula / gist:ca4ef3c80ffcd61043901309291dd966
Created July 28, 2023 07:22
Listing 2. Amazon S3 Bucket structure under Raw Data Zone
raw-zone/
└── products/
└── year=2023/
└── month=05/
├── day=01/
| └── data.json
├── day=02/
| └── data.json
└── (...)
└── (...)
@rmitula
rmitula / data.json
Last active July 28, 2023 07:20
Listing 1. Sample products records are stored in the Raw Data Zone under /day=01 partition
{
"product_id": "29e17633-8d1e-4d63-8291-7a34fd79a4e5",
"name": "Product A",
"category": "Electronics",
"variants": [
{
"color": "black",
"size": "M",
"stock": 100
},