Skip to content

Instantly share code, notes, and snippets.

@ottomata
Created September 11, 2017 19:29
Show Gist options
  • Select an option

  • Save ottomata/bd40928a17f80356ee9c590c6d9ae8c3 to your computer and use it in GitHub Desktop.

Select an option

Save ottomata/bd40928a17f80356ee9c590c6d9ae8c3 to your computer and use it in GitHub Desktop.
// Files written with Spark 2.1
val input = "..."
val output = "hdfs://analytics-hadoop/tmp/otto/webrequest_sampled_1000_export_test"
val parquet = spark.read.parquet(input)
// Parquet
parquet.write.format("parquet").option("compression", "uncompressed").mode("overwrite").save(output + "/parquet")
parquet.write.format("parquet").option("compression", "snappy").mode("overwrite").save(output + "/parquet_snappy")
parquet.write.format("parquet").option("compression", "gzip").mode("overwrite").save(output + "/parquet_gzip")
// JSON
parquet.write.format("json").mode("overwrite").save(output + "/json")
parquet.write.format("json").option("compression", "snappy").mode("overwrite").save(output + "/json_snappy")
parquet.write.format("json").option("compression", "gzip").mode("overwrite").save(output + "/json_gzip")
// Avro
import com.databricks.spark.avro._ // https://search.maven.org/#artifactdetails%7Ccom.databricks%7Cspark-avro_2.11%7C3.2.0%7Cjar
spark.conf.set("spark.sql.avro.compression.codec", "uncompressed")
parquet.write.format("com.databricks.spark.avro").mode("overwrite").save(output + "/avro")
spark.conf.set("spark.sql.avro.compression.codec", "snappy")
parquet.write.format("com.databricks.spark.avro").option("compression", "snappy").mode("overwrite").save(output + "/avro_snappy")
spark.conf.set("spark.sql.avro.compression.codec", "deflate")
spark.conf.set("spark.sql.avro.deflate.level", "9")
parquet.write.format("com.databricks.spark.avro").mode("overwrite").save(output + "/avro_deflate")
Sizes of one hour of webrequest text sampled 1/1000 data, exported in different formats using different compression codecs.
72M parquet
16M parquet_gzip
26M parquet_snappy
234M json
29M json_gzip
58M json_snappy
152M avro
32M avro_deflate (level 9)
46M avro_snappy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment