(In this document I pay attention mostly to data storage in scientific applications, not to web protocols.)
- XML:
- slow to parse
- schemas (.xsd) are human-readable but hard to edit without special software
- tooling for generating code for reading/writing is limited (mostly to Java)
- not suited for binary data
- more on XML: http://c2.com/cgi/wiki?XmlSucks
- JSON:
- much simpler than XML, but also more limited
- shares most XML disadvantages
- web-friendly
- HDF5:
- designed for storing groups of huge N-dimensional arrays
- insanely complex format specification
- "The reference implementation of the HDF5 File Format and I/O Library (http://hdf.ncsa.uiuc.edu/HDF5/) consists of approximately 2073 files or about 917,000 lines of the source code."
- there's no full implementation beyond the reference one
- includes chunking and compression of arrays
- needs parameter tweaking for getting good performance
- NASA uses it to store Earth observation data
- no built-in indexing support
- PyTables added indices on top of HDF5 (http://www.blosc.org/docs/OPSI-indexes.pdf), but it limits file access to Python only
- SQLite:
- flexible single-file SQL database, perfect for storing tables
- very widely used
- built-in indexing (including multi-dimensional)
- authors suggest to use it as an application format: https://www.sqlite.org/appfileformat.html
- no compression, although it's available as a commercial add-on: http://www.hwaci.com/sw/sqlite/zipvfs.html
- generate reading/writing code for multiple programming languages
- allow flexible iteration of schemas (adding/removing/deprecating fields)
Mainly used nowadays:
- Google Protocol Buffers:
- advertised as XML successor
- heavily used in Google internally (open-sourced in 2008), thus very sustainable
- OpenStreetMaps project started to use Protocol Buffers (PBF format):
- http://planet.openstreetmap.org/ offers it for downloading alongside XML
- out-of-the-box encoding/decoding is not great
- several OSM editors use a specialized implementation as a workaround: https://github.com/mapbox/protozero
- An example of scientific use is vg, toolset for analyzing variation graphs
- also not entirely happy with encoding/decoding times: vgteam/vg#97
- Apache Thrift / Apache Avro:
- similar to Protobuf in approach, I didn't have time to dig deeper
- documentation is horrible compared to alternatives
Very recent tools (zero-copy decoding, i.e. a file can be just mmap-ed and accessed with minimal overhead):
- Google Flatbuffers:
- Cap'n Proto:
- author worked at Google, wrote Protocol Buffers v2, learned all its downsides
- supports only trees of objects, not DAGs, it's suggested to use integer ids as a workaround
pycapnp
is easier to use and has 10x more downloads on PyPI thanflatbuffers
- intended for interprocess communication (=> support for multiple languages)
- dogfooded by the awesome SandStorm project
- creating new formats is easy with Protocol Buffers and the like
- evolving and human-readable schema instead of hand-written specification
- generated file format is immediately usable with many programming languages
- storing multidimensional arrays needs a bit of extra work:
- chunking (trivial, especially with matrix libraries like numpy)
- compression (stable libraries: zlib/bzip2 for long-term storage, LZ4 for speed)
- in principle, the two points above can be solved in one shot with Blosc
- the format of 1.x version is stable but lacks a formal specification
- SQLite/HDF5 are flexible and often good enough, but:
- for large datasets careful choice of parameters and settings is required
- complicated data structures have to be fit into the supported data model (relational and hierarchical, respectively), which requires extra efforts