Serialization: best practices

(In this document I pay attention mostly to data storage in scientific applications, not to web protocols.)

Traditional approaches

XML:
- slow to parse
- schemas (.xsd) are human-readable but hard to edit without special software
- tooling for generating code for reading/writing is limited (mostly to Java)
- not suited for binary data
- more on XML: http://c2.com/cgi/wiki?XmlSucks
JSON:
- much simpler than XML, but also more limited
- shares most XML disadvantages
- web-friendly
HDF5:
- designed for storing groups of huge N-dimensional arrays
- insanely complex format specification
  - "The reference implementation of the HDF5 File Format and I/O Library (http://hdf.ncsa.uiuc.edu/HDF5/) consists of approximately 2073 files or about 917,000 lines of the source code."
  - there's no full implementation beyond the reference one
- includes chunking and compression of arrays
  - needs parameter tweaking for getting good performance
- NASA uses it to store Earth observation data
- no built-in indexing support
  - PyTables added indices on top of HDF5 (http://www.blosc.org/docs/OPSI-indexes.pdf), but it limits file access to Python only
SQLite:
- flexible single-file SQL database, perfect for storing tables
- very widely used
- built-in indexing (including multi-dimensional)
- authors suggest to use it as an application format: https://www.sqlite.org/appfileformat.html
- no compression, although it's available as a commercial add-on: http://www.hwaci.com/sw/sqlite/zipvfs.html

Mainly used nowadays:

Google Protocol Buffers:
- advertised as XML successor
- heavily used in Google internally (open-sourced in 2008), thus very sustainable
- OpenStreetMaps project started to use Protocol Buffers (PBF format):
  - http://planet.openstreetmap.org/ offers it for downloading alongside XML
  - out-of-the-box encoding/decoding is not great
  - several OSM editors use a specialized implementation as a workaround: https://github.com/mapbox/protozero
- An example of scientific use is vg, toolset for analyzing variation graphs
  - also not entirely happy with encoding/decoding times: vgteam/vg#97
Apache Thrift / Apache Avro:
- similar to Protobuf in approach, I didn't have time to dig deeper
  - comparison: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
- documentation is horrible compared to alternatives

Very recent tools (zero-copy decoding, i.e. a file can be just mmap-ed and accessed with minimal overhead):

creating new formats is easy with Protocol Buffers and the like
- evolving and human-readable schema instead of hand-written specification
- generated file format is immediately usable with many programming languages
storing multidimensional arrays needs a bit of extra work:
- chunking (trivial, especially with matrix libraries like numpy)
- compression (stable libraries: zlib/bzip2 for long-term storage, LZ4 for speed)
- in principle, the two points above can be solved in one shot with Blosc
  - the format of 1.x version is stable but lacks a formal specification
SQLite/HDF5 are flexible and often good enough, but:
- for large datasets careful choice of parameters and settings is required
- complicated data structures have to be fit into the supported data model (relational and hierarchical, respectively), which requires extra efforts