- Stores schema information along with the data
- Columnar storage/file format
- "reference file format on Hadoop HDFS"
- "read-optimized view of data"
- excellent for local file storage on HDFS (instead of external databases).
- writing very large datasets to disk
- supports schema and schema evolution.
- faster than json/gzip
- FIXME Show how the data is laid out on disk -- trivially row by row, parquet-way columns by columns
- row format is good to read an entire row (using index to access the first/leading column)
- But usually your queries don't need entire rows, but a subset of columns (to do calculation, distribution, aggregation, ranking)
- Columnar format places all the values of a single column first, followed by the values of the following column, and so on
- Index of where a column starts (and ends)
- Read only columns (and their values) you need
- Better compression of same "shape" data (described by a schema -- a type)
- repeatitions
- similarities
- ranges
- type-specific encodings => helps gzip, lz
- compact representation (and be able to replace strings with more optimizable numbers)
- reduce storage cost
- reduce IO for queries
- Uses thrift to represent internal data structures
- Nested data structures (e.g. maps, lists, etc.)
- Standardized storage format available for any framework (via object model converters)
- e.g. Impala, Pig, Cascading, Hive, Spark, Protobuf, Avro, Scalding
- Object model converters
- Column Readers
- SAX-like conversion model
- No framework preference
- No language preference
- Built-in schema evolution
- Skips unnecessary deserialization
- deserialization expensive
- comparision (for filtering) on byte-level
- Support for vectorized operations
- Joint effort of Twitter and Cloudera (Impala)
- Saves only "leaves" in model trees
- Schema always available (and foundational for any optimizations like null removal)
- Number of columns to read impact the performance
- the more columns the worse
...when Avro and some others were already available
- 100- or 1000-node HDFS cluster with PBs of data
- How to layout and compress data to save space (and store more) and to lower access time (and transfer less data)
- Less disk and network IOs
- How to access data efficiently
- Row group
- Column chunk
- Page
- header
- data encoded and compressed
- Metadata
- ORC is a columnar file format used mainly in Hive (Stinger)
- "Lots of tentacles of Hive in ORC format" so it's not as independent and standalone as Parquet
- parquet.io or parquet.apache.org
- @ApacheParquet on twitter
- #parquetformat on twitter
- (video) Parquet: Open-source columnar format for Hadoop (1 of 6)
- (video) Parquet: Open-source columnar format for Hadoop (2 of 6)
- (video) Parquet: Open-source columnar format for Hadoop (3 of 6)
- The remaining videos in the series are not really worth your time. Skip them as there's simply too much blahblahblah.