Pekka Väänänen, Aug 19 2021.
This proposal is a response to It's Time to Retire the CSV by Alex Rasmussen and the discussion on lobste.rs. Don't take it too seriously.
CSV files (comma-separated values) are great but sometimes difficult to parse because everybody seems to have a slightly different idea what CSV means. The obvious solution is to transmit some metadata that tells what to expect but where do you put it? Well, how about a ZIP archive?
An archive with two files. The first file, say format.txt
, has the metadata inside and the second one is the original CSV file unchanged. This is still readable by non-technical users because ZIP files are natively supported by both Windows and macOS. People can double click on them like a directory and then double click again on the CSV to open it up in Excel.
I know it sounds simplistic but if there's a lesson to be learned from the history of computing, it's that stupid ideas often win. By making this extended CSV format at least somewhat backwards compatible, it's possible (in theory) to switch to it without enraging your customers.
Let's try to sketch something just for the sake of discussion. Let there be two formats.
The File Format. A ZIP archive, either uncompressed or compressed with the DEFLATE algorithm. The archive contains at least two files:
format.txt
, the metadata file*.csv
, a CSV file
There can be multiple CSV files but they must all respect format.txt
.
The Metadata Format. Very loose. The first line of format.txt
must contain an ASCII encoded metadata type name, terminated by a linefeed. The rest of the file is then interpreted according to that name.
For example if we'd like to use the CSV Dialect then format.txt
could say this:
CSV Dialect v1.2
{
"dialect": {
"csvddfVersion": 1.2,
"delimiter": ";",
"doubleQuote": true,
"lineTerminator": "\r\n",
"quoteChar": "\"",
"skipInitialSpace": true,
"header": true,
"commentChar": "#"
}
}
This way different metadata formats could evolve without breaking the overall scheme.
Maybe but possibly not enough. A mismatch between metadata and the CSV can still happen and there's nothing we can do about it as long as CSV is editable by anyone with a text editor. Also, the maximum file size limit of the ZIP format is unfortunate.
- Q: Why not use a tarball?
- They are incomprehensible to Windows users.
- Q: How do you store CSV files larger than the ZIP's maximum file size of 2^32-1 bytes?
- Save the archive as ZIP64. It's supported by Windows Explorer since Vista but macOS seems to have too old of a version of
unzip
. Not a great solution.
- Save the archive as ZIP64. It's supported by Windows Explorer since Vista but macOS seems to have too old of a version of
- Q: How do you do random access?
- Save the ZIP file uncompressed and put some kind of index in the metadata.
- Q: Have you seen that XKCD comic about standards?
- Have you heard of thought-terminating clichés?
Python has CSV Sniffer that does something like this. Presumably Pandas is also doing this and could expose the inferred dialect without actually reading the rest of the file.