Skip to content

Instantly share code, notes, and snippets.

@tommydangerous
Last active May 13, 2021 17:27
Show Gist options
  • Select an option

  • Save tommydangerous/be8d93d4bd3901b3ba61f03bc2e248da to your computer and use it in GitHub Desktop.

Select an option

Save tommydangerous/be8d93d4bd3901b3ba61f03bc2e248da to your computer and use it in GitHub Desktop.
PySpark load data from S3
from pyspark.sql import SparkSession
def load_data(spark, s3_location):
"""
spark:
Spark session
s3_location:
S3 bucket name and object prefix
"""
return (
spark
.read
.options(
delimiter=',',
header=True,
inferSchema=False,
)
.csv(s3_location)
)
with SparkSession.builder.appName('Mage').getOrCreate() as spark:
# 1. Load data from S3 files
df = load_data(spark, 's3://feature-sets/users/profiles/v1/*')
# 2. Group data by 'user_id' column
grouped = df.groupby('user_id')
# 3. Apply function named 'custom_transformation_function';
# we will define this function later in this article
df_transformed = grouped.apply(custom_transformation_function)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment