pekkavaa/extending csv.md

Last active November 13, 2022 01:07

Star (11) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/pekkavaa/97d02d6518db8b9e4c5b81fc686553e2.js"></script>
Save pekkavaa/97d02d6518db8b9e4c5b81fc686553e2 to your computer and use it in GitHub Desktop.

Download ZIP

A Straightforward Way To Extend CSV With Metadata

Raw

extending csv.md

A Straightforward Way To Extend CSV With Metadata

Pekka Väänänen, Aug 19 2021.

This proposal is a response to It's Time to Retire the CSV by Alex Rasmussen and the discussion on lobste.rs. Don't take it too seriously.

CSV files (comma-separated values) are great but sometimes difficult to parse because everybody seems to have a slightly different idea what CSV means. The obvious solution is to transmit some metadata that tells what to expect but where do you put it? Well, how about a ZIP archive?

An archive with two files. The first file, say format.txt, has the metadata inside and the second one is the original CSV file unchanged. This is still readable by non-technical users because ZIP files are natively supported by both Windows and macOS. People can double click on them like a directory and then double click again on the CSV to open it up in Excel.

I know it sounds simplistic but if there's a lesson to be learned from the history of computing, it's that stupid ideas often win. By making this extended CSV format at least somewhat backwards compatible, it's possible (in theory) to switch to it without enraging your customers.

The Spec

Let's try to sketch something just for the sake of discussion. Let there be two formats.

The File Format. A ZIP archive, either uncompressed or compressed with the DEFLATE algorithm. The archive contains at least two files:

format.txt, the metadata file
*.csv, a CSV file

There can be multiple CSV files but they must all respect format.txt.

The Metadata Format. Very loose. The first line of format.txt must contain an ASCII encoded metadata type name, terminated by a linefeed. The rest of the file is then interpreted according to that name.

For example if we'd like to use the CSV Dialect then format.txt could say this:

CSV Dialect v1.2
{
  "dialect": {
    "csvddfVersion": 1.2,
    "delimiter": ";",
    "doubleQuote": true,
    "lineTerminator": "\r\n",
    "quoteChar": "\"",
    "skipInitialSpace": true,
    "header": true,
    "commentChar": "#"
  }
}

This way different metadata formats could evolve without breaking the overall scheme.

Would this scheme really help me parse CSVs?

Maybe but possibly not enough. A mismatch between metadata and the CSV can still happen and there's nothing we can do about it as long as CSV is editable by anyone with a text editor. Also, the maximum file size limit of the ZIP format is unfortunate.

FAQ

Q: Why not use a tarball?
- They are incomprehensible to Windows users.
Q: How do you store CSV files larger than the ZIP's maximum file size of 2^32-1 bytes?
- Save the archive as ZIP64. It's supported by Windows Explorer since Vista but macOS seems to have too old of a version of unzip. Not a great solution.
Q: How do you do random access?
- Save the ZIP file uncompressed and put some kind of index in the metadata.
Q: Have you seen that XKCD comic about standards?
- Have you heard of thought-terminating clichés?

cjohnson318 commented Aug 19, 2021

I really like that link to Frictionless Data. It'd be nice to have a pandas function to read a CSV dialect in a zip archive like that and just know what to do with it instead of opening the CSV in vim or something, figuring out what delimiters there are, adding that argument to pandas.read_csv(...), trying to open it, failing, opening the CSV with vim again, etc.

markstos commented Aug 19, 2021

A precedent, the GTFS spec is a zip file of CSVS, some of which could be considered metadata:

https://gtfs.org/

This is an industry standard for transit feeds, championed by Google.

In practice, it's OK to work with, as there are plenty of tools that work with zip files and CSV.

yakovsh commented Aug 19, 2021

(I am the author of the CSV RFC / RFC 4180)

The W3C tried something similar as well:
https://www.w3.org/2013/csvw/wiki/Main_Page

waffletower commented Aug 19, 2021 •

edited

Loading

The multitudes that generate CSV files will not soon, nor perhaps ever, adopt such a standard -- no matter how effective and noble it may be. I think a more plausible idea would be the creation of a tool that could infer all or a subset of these metadata parameters from a given CSV file.

Such a tool would not be restricted to returning pedestrian JSON with camelCase identifiers either 👃

djedr commented Aug 19, 2021 •

edited

Loading

Here is a minimalist alternative as a parser in JavaScript: https://gist.github.com/djedr/8349415de904eeade9bb1cb5e860e3ec

You can define custom separator characters for a simple format for tabular data which doesn't have the escaping issues of CSV.

drtbiswas commented Aug 20, 2021

This is common in the 3rd party data vendor industry as well where there would be a schema text file along with the data files in the folder overall that is zipped up transmitted.

sterlinm commented Aug 20, 2021

a more plausible idea would be the creation of a tool that could infer all or a subset of these metadata parameters from a given CSV file.

Python has CSV Sniffer that does something like this. Presumably Pandas is also doing this and could expose the inferred dialect without actually reading the rest of the file.

Author

pekkavaa commented Aug 20, 2021

Have you seen the bagit file format Its used in computational biology platforms but is generic enough to apply to any computational workload. It works with the same idea of using zip and seperate files for metadata.

I haven't, it seems pretty much the same idea but with more complexity.

The multitudes that generate CSV files will not soon, nor perhaps ever, adopt such a standard -- no matter how effective and noble it may be.

One can dream, right?

This is common in the 3rd party data vendor industry as well where there would be a schema text file along with the data files in the folder overall that is zipped up transmitted.

Yeah I expected something like that :)

denis-bz commented Oct 17, 2022

One file is better than 2: the metadata stays with the data. How about

# header e.g. metadata
# header ...
# ...
topline  e.g. column names in a .csv
... the rest

A .csv like this can be read with NO changes, ignoring the header, in python pandas:

pd.read_csv( filename, comment="#", sep= ... )

There are simple hacks read_metacsv( csvin ) --> df, headerlines
and write_metacsv( csvout: str, df: pd.DataFrame, header: list[str] )
which I could put up on gist.github for anyone who'd want to test them.

pekkavaa/extending csv.md

A Straightforward Way To Extend CSV With Metadata

The Spec

Would this scheme really help me parse CSVs?

FAQ

cjohnson318 commented Aug 19, 2021

Uh oh!

markstos commented Aug 19, 2021

Uh oh!

yakovsh commented Aug 19, 2021

Uh oh!

waffletower commented Aug 19, 2021 •

edited

Loading

Uh oh!

djedr commented Aug 19, 2021 •

edited

Loading

Uh oh!

drtbiswas commented Aug 20, 2021

Uh oh!

sterlinm commented Aug 20, 2021

Uh oh!

pekkavaa commented Aug 20, 2021

Uh oh!

denis-bz commented Oct 17, 2022

Uh oh!

pekkavaa/extending csv.md

A Straightforward Way To Extend CSV With Metadata

The Spec

Would this scheme really help me parse CSVs?

FAQ

cjohnson318 commented Aug 19, 2021

Uh oh!

markstos commented Aug 19, 2021

Uh oh!

yakovsh commented Aug 19, 2021

Uh oh!

waffletower commented Aug 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djedr commented Aug 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drtbiswas commented Aug 20, 2021

Uh oh!

sterlinm commented Aug 20, 2021

Uh oh!

pekkavaa commented Aug 20, 2021

Uh oh!

denis-bz commented Oct 17, 2022

Uh oh!

waffletower commented Aug 19, 2021 •

edited

Loading

djedr commented Aug 19, 2021 •

edited

Loading