Skip to content

Instantly share code, notes, and snippets.

@jakechen
Last active October 5, 2021 03:40
Show Gist options
  • Save jakechen/6955f2de51212163312b6430555b8e0b to your computer and use it in GitHub Desktop.
Save jakechen/6955f2de51212163312b6430555b8e0b to your computer and use it in GitHub Desktop.
Creating PySpark DataFrame from CSV in AWS S3 in EMR
# Example uses GDELT dataset found here: https://aws.amazon.com/public-datasets/gdelt/
# Column headers found here: http://gdeltproject.org/data/lookups/CSV.header.dailyupdates.txt
# Load RDD
lines = sc.textFile("s3://gdelt-open-data/events/2016*") # Loads 73,385,698 records from 2016
# Split lines into columns; change split() argument depending on deliminiter e.g. '\t'
parts = lines.map(lambda l: l.split('\t'))
# Convert RDD into DataFrame
from urllib import urlopen
html = urlopen("http://gdeltproject.org/data/lookups/CSV.header.dailyupdates.txt").read().rstrip()
columns = html.split('\t')
df = spark.createDataFrame(parts, columns)
# Loads RDD
lines = sc.textFile("s3://jakechenaws/tutorials/sample_data/iris/iris.csv")
# Split lines into columns; change split() argument depending on deliminiter e.g. '\t'
parts = lines.map(lambda l: l.split(','))
# Convert RDD into DataFrame
df = spark.createDataFrame(parts, ['sepal_length','sepal_width','petal_length','petal_width','class'])
@RohitJain88
Copy link

Is this spark application running on local or in EMR?

@angadsingh
Copy link

its not running anywhere. its just lying in a gist here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment