Last active
March 1, 2021 04:07
-
-
Save nmukerje/38183d0622ac2195552fd834b53fd7ee to your computer and use it in GitHub Desktop.
Converts the GDELT Dataset in S3 to Parquet.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Get the column names | |
from urllib import urlopen | |
html = urlopen("http://gdeltproject.org/data/lookups/CSV.header.dailyupdates.txt").read().rstrip() | |
columns = html.split('\t') | |
# Load 73,385,698 records from 2016 | |
df1 = spark.read.option("delimiter", "\t").csv("s3://gdelt-open-data/events/2016*") | |
# Apply the schema | |
df2=df1.toDF(*columns) | |
# Split SQLDATE to Year, Month and Day | |
from pyspark.sql.functions import expr | |
df3 = df2.withColumn("Month", expr("substring(SQLDATE, 5, 2)")).withColumn("Day", expr("substring(SQLDATE, 7, 2)")) | |
# Write to parquet in S3 | |
cols=["Year","Month","Day"] | |
df3.repartition(*cols).write.mode("append").partitionBy(cols).parquet("s3://<bucket>/gdelt/") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment