Skip to content

Instantly share code, notes, and snippets.

View r0yfire's full-sized avatar

Roy Firestein r0yfire

View GitHub Profile
@r0yfire
r0yfire / gist:d82f4f0a1b604db3b05e8f9e346a6459
Created March 29, 2017 19:18
Massively parallel copy S3 bucket using pyspark.
from operator import add
import concurrent
from concurrent.futures import ThreadPoolExecutor
from boto.s3.connection import S3Connection
from pyspark import SparkContext
def computeTargets(bucketName, prefix=""):
s3 = S3Connection()