Skip to content

Instantly share code, notes, and snippets.

@samueltc
Created December 25, 2016 13:34
Show Gist options
  • Save samueltc/bc8b2c73ff593594d300ad53c45a4da1 to your computer and use it in GitHub Desktop.
Save samueltc/bc8b2c73ff593594d300ad53c45a4da1 to your computer and use it in GitHub Desktop.
from pyspark import SparkContext, SparkConf
from boto.s3.connection import S3Connection
def process(key):
return key.name
if __name__=='__main__':
bucket_name = 'test-bucket'
conn = S3Connection()
bucket = conn.get_bucket(bucket_name)
keys = bucket.list()
conf = SparkConf()
sc = SparkContext(conf=conf)
out = sc.parallelize(keys)\
.map(process) \
.saveAsTextFile('list')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment