-
-
Save msukmanowsky/18531dba2bec928424c0 to your computer and use it in GitHub Desktop.
| from pyspark import SparkContext | |
| def main(): | |
| sc = SparkContext(appName="Test Compression") | |
| # RDD has to be key, value pairs | |
| data = sc.parallelize([ | |
| ("key1", "value1"), | |
| ("key2", "value2"), | |
| ("key3", "value3"), | |
| ]) | |
| data.saveAsHadoopFile("/tmp/spark_compressed", | |
| "org.apache.hadoop.mapred.TextOutputFormat", | |
| compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec") | |
| sc.stop() | |
| if __name__ == "__main__": | |
| main() |
Looks like this will work for me: saveAsTextFile(String path, Class<? extends org.apache.hadoop.io.compress.CompressionCodec> codec)
The parameter types to saveAsHadoopFile require the RDD to be of type pairRDD, and you explicitly made data a key-value object. Is it possible to compress Spark outputs that are not in key-value form? My research indicates no without writing your own method, i.e. the Spark API doesn't support it, which seems strange.
Jayson, you can use
rdd.map(line=>(line, ""))
to turn it into pairRDD.
Jayson,
Building on what gshen commented, you might be able to use:
rdd.map(line=>(line, None))
before calling saveAsHadoopFile(...). It's not obvious from the documentation, but it looks like None in Python gets mapped to NullWritable when saveAsHadoopFile creates the underlying TextOutputFormat<K,V>. This causes the TextOutputFormat to effectively skip writing the value, leaving just the key text -- no extra whitespace tacked onto the end. You might want to try it and see if it works for you.
You can use any of the Hadoop-supported compression codecs:
org.apache.hadoop.io.compress.GzipCodecorg.apache.hadoop.io.compress.BZip2Codeccom.hadoop.compression.lzo.LzopCodec