Summary of the problem from mz5
paper (concerning .mzML but just as true for .imzML):
Although based on excellent ontologies, relying on the extended markup language (XML) for the straightforward implementation of mzData, mzXML, and mzML makes for a major efficiency bottleneck. XML was designed to be a human readable, textual data format with considerable inherent verbosity and redundancy. XML was not designed for efficient bulk data storage, and the general modus operandi requires reading complete files to construct the XML parse tree. The mzXML and mzML formats partly circumvent these limitations by using base-64 encoding and (optional) compression of the raw MS scan data in combination with an application-specific indexing system. Despite the improvements gained from these efforts, vendor formats in general outperform mzXML and mzML in terms of space requirements, as well as in read and write efficiency.
-
currently limited to collections of (m/z, abundance) and (time, abundance)
Designed for LC-MS data but extension for imaging MS data should be easy.
- usage of JSON for metadata was considered but rejected
- instead, metadata can be stored as XML, although there are also tables for metadata
- not possible to store both centroided and profile data in the same file
- data is organized into chunks
- range queries are implemented with R*tree structure which is built into SQLite
- SQLite does all the indexing, although the setup of chunking and multiple indices is not trivial
- compression is planned for the next version (MS-Numpress)
OpenMSI data format (HDF5-based)
- Designed only for imaging MS data, not for LC-MS
- Supports only profile-mode data (binning is performed on centroided data)
- By default stores two copies of data for fast access to both spectra and images.
Some closed formats also store two copies of data in profile mode: msiQuant, Scils Lab .sl format. In msiQuant, centroided data is not binned but converted to profile via resolution estimation and gaussian smoothing.