Created
September 11, 2017 19:29
-
-
Save ottomata/bd40928a17f80356ee9c590c6d9ae8c3 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| // Files written with Spark 2.1 | |
| val input = "..." | |
| val output = "hdfs://analytics-hadoop/tmp/otto/webrequest_sampled_1000_export_test" | |
| val parquet = spark.read.parquet(input) | |
| // Parquet | |
| parquet.write.format("parquet").option("compression", "uncompressed").mode("overwrite").save(output + "/parquet") | |
| parquet.write.format("parquet").option("compression", "snappy").mode("overwrite").save(output + "/parquet_snappy") | |
| parquet.write.format("parquet").option("compression", "gzip").mode("overwrite").save(output + "/parquet_gzip") | |
| // JSON | |
| parquet.write.format("json").mode("overwrite").save(output + "/json") | |
| parquet.write.format("json").option("compression", "snappy").mode("overwrite").save(output + "/json_snappy") | |
| parquet.write.format("json").option("compression", "gzip").mode("overwrite").save(output + "/json_gzip") | |
| // Avro | |
| import com.databricks.spark.avro._ // https://search.maven.org/#artifactdetails%7Ccom.databricks%7Cspark-avro_2.11%7C3.2.0%7Cjar | |
| spark.conf.set("spark.sql.avro.compression.codec", "uncompressed") | |
| parquet.write.format("com.databricks.spark.avro").mode("overwrite").save(output + "/avro") | |
| spark.conf.set("spark.sql.avro.compression.codec", "snappy") | |
| parquet.write.format("com.databricks.spark.avro").option("compression", "snappy").mode("overwrite").save(output + "/avro_snappy") | |
| spark.conf.set("spark.sql.avro.compression.codec", "deflate") | |
| spark.conf.set("spark.sql.avro.deflate.level", "9") | |
| parquet.write.format("com.databricks.spark.avro").mode("overwrite").save(output + "/avro_deflate") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Sizes of one hour of webrequest text sampled 1/1000 data, exported in different formats using different compression codecs. | |
| 72M parquet | |
| 16M parquet_gzip | |
| 26M parquet_snappy | |
| 234M json | |
| 29M json_gzip | |
| 58M json_snappy | |
| 152M avro | |
| 32M avro_deflate (level 9) | |
| 46M avro_snappy | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment