Compare HDF5 and Feather performance (speed, file size) for storing / reading pandas dataframes

osjatest commented May 11, 2020

Would be great if author can extend benchmark using compression for hdf5 format. I'm looking for the best data format to store huge number of data divided on files with ~3000 data rows in each. But since I need to store huge number of such files, I have to trade-off between speed and size. In that sense, compression is important parameter for me and I'm interested to compare compressed hdf5 and feather.

abalhomaid commented Mar 4, 2021

Very clear and concise comparison. Many thanks for doing this!

Davidmenamm commented Nov 2, 2021

Good analysis, thanks!

fizban99 commented Jun 19, 2022

As of 2022, to_feather compresses data by default with lz4. Using hdf5 with blosc:lz4 complevel 5 reaches a similar compression ratio. If you add strings into the mix, the superiority of feather is not that clear with big dataframes, specially in reading times. See modified version at https://github.com/fizban99/hdf_vs_feather/blob/main/hdf_vs_feather.ipynb

Qoo0607 commented Aug 13, 2022

Great analysis, thanks for your sharing.

gansanay/hdf_vs_feather.ipynb

osjatest commented May 11, 2020

abalhomaid commented Mar 4, 2021

Davidmenamm commented Nov 2, 2021

fizban99 commented Jun 19, 2022

Qoo0607 commented Aug 13, 2022