Basic file formats - such as CSV, JSON or other text formats - can be useful when exchanging data between applications. When it comes to storing intermediate data between steps of an application, Parquet can provide more advanced capabilities:
- Support for complex types, as opposed to string-based types (CSV) or a limited type system (JSON only supports strings, basic numbers, booleans).
- Columnar storage - more efficient when not all the columns are used or when filtering the data.
- Partitioning - files are partitioned out of the box
- Compression - pages can be compressed with Snappy or Gzip (this preserves the partitioning)
The tests here are performed with Spark 2.0.1 on a cluster with 3 workers (c4.4xlarge
, 16 vCPU and 30 GB each).