Skip to content

Instantly share code, notes, and snippets.

@pm-hwks
Last active January 9, 2020 03:17
Show Gist options
  • Save pm-hwks/039e832f19ed1a2f4ec169da9d4ed10a to your computer and use it in GitHub Desktop.
Save pm-hwks/039e832f19ed1a2f4ec169da9d4ed10a to your computer and use it in GitHub Desktop.
[pyspark - Access S3 data] Access S3 data from pyspark #spark #pyspark #s3
## Reference :- https://docs.cloudera.com/documentation/enterprise/5-9-x/topics/spark_s3.html
## Pyspark/ Python:
## Step1 : Generate Hadoop AWS credential file (Run this on a cluster node)
# hadoop credential create fs.s3a.access.key -provider jceks://hdfs/user/centos/awskeyfile.jceks -value AKI*****************
# hadoop credential create fs.s3a.secret.key -provider jceks://hdfs/user/centos/awskeyfile.jceks -value kd8**********************************
## Step 2 : simple pyspark program to access S3 file (s3_access.py)
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = SparkConf().setAppName('s3_access_py')
sc = SparkContext(conf=conf)
if __name__ == '__main__':
sql = SparkSession(sc)
csv_df = sql.read.csv('s3a://prms-s3/data/s1.csv')
print("***********************************************************************")
csv_df.show()
print("***********************************************************************")
## Step 3 : Spark submit & run the program
# spark-submit --conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/user/centos/awskeyfile.jceks s3_access.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment