Last active
January 9, 2020 03:17
-
-
Save pm-hwks/039e832f19ed1a2f4ec169da9d4ed10a to your computer and use it in GitHub Desktop.
[pyspark - Access S3 data] Access S3 data from pyspark #spark #pyspark #s3
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Reference :- https://docs.cloudera.com/documentation/enterprise/5-9-x/topics/spark_s3.html | |
## Pyspark/ Python: | |
## Step1 : Generate Hadoop AWS credential file (Run this on a cluster node) | |
# hadoop credential create fs.s3a.access.key -provider jceks://hdfs/user/centos/awskeyfile.jceks -value AKI***************** | |
# hadoop credential create fs.s3a.secret.key -provider jceks://hdfs/user/centos/awskeyfile.jceks -value kd8********************************** | |
## Step 2 : simple pyspark program to access S3 file (s3_access.py) | |
from pyspark import SparkConf, SparkContext | |
from pyspark.sql import SparkSession | |
conf = SparkConf().setAppName('s3_access_py') | |
sc = SparkContext(conf=conf) | |
if __name__ == '__main__': | |
sql = SparkSession(sc) | |
csv_df = sql.read.csv('s3a://prms-s3/data/s1.csv') | |
print("***********************************************************************") | |
csv_df.show() | |
print("***********************************************************************") | |
## Step 3 : Spark submit & run the program | |
# spark-submit --conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/user/centos/awskeyfile.jceks s3_access.py | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment