See the Avro doc page.
See the Parquet doc page.
- in-memory data format
- columnar
- vectorized operations - SIMD (Single Input Multiple Data)
- zero-copy reads, no serialization
- language support:
- de-facto in-memory analytics format, supported by:
- columnar
- optimized for reads, write overhead
- lightweight indexing or skipping blocks of rows
- basic stats embedded (min, max, sum, count)
- no schema evolution yet
-d dumps data rather than metadata Hive 0.15 / 1.1 onwards
--rowindex <cols>
-t timezone of the writer Hive 1.2 onwards
-j prints metadata as json Hive 1.3 onwards
-p pretty print json
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>
https://pythonhosted.org/chkcsv/
csvkit - CLI tools for working with CSV files
csvgroup - SQL-like selects against CSV files
json2csv - CLI to convert from JSON to CSV:
json2csv
See JSON doc page.
Binary JSON
CoffeeScript Object Notation
Cursive Script Object Notation - used by Atom editor config files
schema-compressed JSON - can omit some syntax which is inferred
- strict superset of JSON
- allows # line comments
- trailing commas or missing commas between elements if separated by newlines
looks like there is a default cson module in Python 2.7
https://github.com/gt3389b/python-cson
pip install python-cson
python-cson $pytools/tests/data/test.json -f /dev/stdout
Tom's Obvious Minimal Language
Similar to ini format.
Has "table expansion" (nesting):
[a.b.c]
d = 'Hello'
e = 'World'
Available from packages:
xmllint --format "$file.xml"
Or pipe XML in via standard input to validate and and use --format
to pretty print it:
xmllint --format - < "$file.xml"
See YAML doc.
- HBase - mutable data, but not scans eg
count(*)
- Avro (row-based) - full scans all fields
- Parquet (columnar) - restricting queries to subset of columns
Ported from private Knowledge Base page 2014+