Created
November 14, 2014 01:32
-
-
Save msukmanowsky/18531dba2bec928424c0 to your computer and use it in GitHub Desktop.
Example of how to save Spark RDDs to disk using GZip compression in response to https://twitter.com/rjurney/status/533061960128929793.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark import SparkContext | |
def main(): | |
sc = SparkContext(appName="Test Compression") | |
# RDD has to be key, value pairs | |
data = sc.parallelize([ | |
("key1", "value1"), | |
("key2", "value2"), | |
("key3", "value3"), | |
]) | |
data.saveAsHadoopFile("/tmp/spark_compressed", | |
"org.apache.hadoop.mapred.TextOutputFormat", | |
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec") | |
sc.stop() | |
if __name__ == "__main__": | |
main() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Jayson,
Building on what gshen commented, you might be able to use:
before calling saveAsHadoopFile(...). It's not obvious from the documentation, but it looks like None in Python gets mapped to NullWritable when saveAsHadoopFile creates the underlying TextOutputFormat<K,V>. This causes the TextOutputFormat to effectively skip writing the value, leaving just the key text -- no extra whitespace tacked onto the end. You might want to try it and see if it works for you.