Skip to content

Instantly share code, notes, and snippets.

@VeylanSolmira
Forked from msukmanowsky/spark_gzip.py
Last active August 29, 2015 14:18
Show Gist options
  • Save VeylanSolmira/69da66c72746c0c56d0e to your computer and use it in GitHub Desktop.
Save VeylanSolmira/69da66c72746c0c56d0e to your computer and use it in GitHub Desktop.
from pyspark import SparkContext
def main():
sc = SparkContext(appName="Test Compression")
# RDD has to be key, value pairs
data = sc.parallelize([
("key1", "value1"),
("key2", "value2"),
("key3", "value3"),
])
data.saveAsHadoopFile("/tmp/spark_compressed",
"org.apache.hadoop.mapred.TextOutputFormat",
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")
sc.stop()
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment