Skip to content

Instantly share code, notes, and snippets.

@jaceklaskowski
Last active December 26, 2017 19:16
Show Gist options
  • Save jaceklaskowski/fd1f5a5d2f83402cdbff102def86e19c to your computer and use it in GitHub Desktop.
Save jaceklaskowski/fd1f5a5d2f83402cdbff102def86e19c to your computer and use it in GitHub Desktop.
Parquet

Parquet

Introduction

  • Stores schema information along with the data
  • Columnar storage/file format
    • "reference file format on Hadoop HDFS"
    • "read-optimized view of data"
  • excellent for local file storage on HDFS (instead of external databases).
  • writing very large datasets to disk
  • supports schema and schema evolution.
  • faster than json/gzip
  • FIXME Show how the data is laid out on disk -- trivially row by row, parquet-way columns by columns
    • row format is good to read an entire row (using index to access the first/leading column)
    • But usually your queries don't need entire rows, but a subset of columns (to do calculation, distribution, aggregation, ranking)
  • Columnar format places all the values of a single column first, followed by the values of the following column, and so on
    • Index of where a column starts (and ends)
    • Read only columns (and their values) you need
    • Better compression of same "shape" data (described by a schema -- a type)
      • repeatitions
      • similarities
      • ranges
      • type-specific encodings => helps gzip, lz
      • compact representation (and be able to replace strings with more optimizable numbers)
      • reduce storage cost
      • reduce IO for queries
  • Uses thrift to represent internal data structures
  • Nested data structures (e.g. maps, lists, etc.)
  • Standardized storage format available for any framework (via object model converters)
    • e.g. Impala, Pig, Cascading, Hive, Spark, Protobuf, Avro, Scalding
    • Object model converters
    • Column Readers
    • SAX-like conversion model
    • No framework preference
    • No language preference
  • Built-in schema evolution
  • Skips unnecessary deserialization
    • deserialization expensive
    • comparision (for filtering) on byte-level
  • Support for vectorized operations
  • Joint effort of Twitter and Cloudera (Impala)
  • Saves only "leaves" in model trees
  • Schema always available (and foundational for any optimizations like null removal)
  • Number of columns to read impact the performance
    • the more columns the worse

Motivation (to have another storage format)

...when Avro and some others were already available

  • 100- or 1000-node HDFS cluster with PBs of data
  • How to layout and compress data to save space (and store more) and to lower access time (and transfer less data)
  • Less disk and network IOs
  • How to access data efficiently

Tools

Design

  • Row group
  • Column chunk
  • Page
    • header
    • data encoded and compressed
  • Metadata

Vs ORC

  1. ORC is a columnar file format used mainly in Hive (Stinger)
  2. "Lots of tentacles of Hive in ORC format" so it's not as independent and standalone as Parquet

Trivia

Resources

  1. (video) Parquet: Open-source columnar format for Hadoop (1 of 6)
  2. (video) Parquet: Open-source columnar format for Hadoop (2 of 6)
  3. (video) Parquet: Open-source columnar format for Hadoop (3 of 6)
  4. The remaining videos in the series are not really worth your time. Skip them as there's simply too much blahblahblah.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment