Skip to content

Instantly share code, notes, and snippets.

@HariSekhon
Created September 23, 2024 22:44
Show Gist options
  • Save HariSekhon/e025329293882c044a445300bb27c63a to your computer and use it in GitHub Desktop.
Save HariSekhon/e025329293882c044a445300bb27c63a to your computer and use it in GitHub Desktop.
data-formats.md from HariSekhon/Knowledge-Base repo: https://github.com/HariSekhon/Knowledge-Base

Data Formats

Avro

See the Avro doc page.

Parquet

See the Parquet doc page.

Arrow

ORC

  • columnar
  • optimized for reads, write overhead
  • lightweight indexing or skipping blocks of rows
  • basic stats embedded (min, max, sum, count)
  • no schema evolution yet
-d   dumps data rather than metadata Hive 0.15 / 1.1 onwards
--rowindex <cols>
-t  timezone of the writer  Hive 1.2 onwards
-j  prints metadata as json Hive 1.3 onwards
-p  pretty print json
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>

CSV

https://pythonhosted.org/chkcsv/

csvkit - CLI tools for working with CSV files

csvgroup - SQL-like selects against CSV files

json2csv - CLI to convert from JSON to CSV:

json2csv

JSON

See JSON doc page.

BSON

Binary JSON

CSON

CoffeeScript Object Notation

Cursive Script Object Notation - used by Atom editor config files

schema-compressed JSON - can omit some syntax which is inferred

  • strict superset of JSON
  • allows # line comments
  • trailing commas or missing commas between elements if separated by newlines

looks like there is a default cson module in Python 2.7

https://github.com/gt3389b/python-cson

pip install python-cson
python-cson $pytools/tests/data/test.json -f /dev/stdout

TOML

Tom's Obvious Minimal Language

Similar to ini format.

Has "table expansion" (nesting):

[a.b.c]
d = 'Hello'
e = 'World'

XML

XML Lint

Available from packages:

  • libxml (RPM)
  • libxml2-utils ( Deb / Apk )
  • automatically available on Mac
xmllint --format "$file.xml"

Or pipe XML in via standard input to validate and and use --format to pretty print it:

xmllint --format - < "$file.xml"

YAML

See YAML doc.

HBase vs Parquet vs Avro

  • HBase - mutable data, but not scans eg count(*)
  • Avro (row-based) - full scans all fields
  • Parquet (columnar) - restricting queries to subset of columns

Ported from private Knowledge Base page 2014+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment