-
-
Save rossant/7b4704e8caeb8f173084 to your computer and use it in GitHub Desktop.
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Quick HDF5 benchmarks" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We compare the performance of reading a subset of a large array:\n", | |
"* in memory with NumPy\n", | |
"* with h5py\n", | |
"* with memmap using an HDF5 file\n", | |
"* with memmap using an NPY file\n", | |
"\n", | |
"This illustrates our performance issues with HDF5 in our very particular use case (accessing a small number of lines in a large \"vertical\" rectangular array)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import h5py\n", | |
"import numpy as np" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"np.random.seed(2016)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We'll use this function to bypass the slow h5py data access with a faster memory mapping (only works on uncompressed contiguous datasets):" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def _mmap_h5(path, h5path):\n", | |
" with h5py.File(path) as f:\n", | |
" ds = f[h5path]\n", | |
" # We get the dataset address in the HDF5 fiel.\n", | |
" offset = ds.id.get_offset()\n", | |
" # We ensure we have a non-compressed contiguous array.\n", | |
" assert ds.chunks is None\n", | |
" assert ds.compression is None\n", | |
" assert offset > 0\n", | |
" dtype = ds.dtype\n", | |
" shape = ds.shape\n", | |
" arr = np.memmap(path, mode='r', shape=shape, offset=offset, dtype=dtype)\n", | |
" return arr" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Number of lines in our test array:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"shape = (100000, 1000)\n", | |
"n, ncols = shape" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We generate a random array:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"arr = np.random.rand(n, ncols).astype(np.float32)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We write it to a file:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"with h5py.File('test.h5', 'w') as f:\n", | |
" f['/test'] = arr" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We load the file once in read mode." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"f = h5py.File('test.h5', 'r')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"np.save('test.npy', arr)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"total 781272\r\n", | |
"-rw-rw-r-- 1 cyrille cyrille 400002144 janv. 8 09:32 test.h5\r\n", | |
"-rw-rw-r-- 1 cyrille cyrille 400000080 janv. 8 09:32 test.npy\r\n", | |
"-rw-rw-r-- 1 cyrille cyrille 7364 janv. 8 09:37 benchmark.ipynb\r\n" | |
] | |
} | |
], | |
"source": [ | |
"%ls -lrt" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Slices" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"ind = slice(None, None, 100)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"in memory\n", | |
"1000 loops, best of 3: 741 µs per loop\n", | |
"\n", | |
"h5py\n", | |
"100 loops, best of 3: 9.65 ms per loop\n", | |
"\n", | |
"memmap of HDF5 file\n", | |
"100 loops, best of 3: 3.95 ms per loop\n", | |
"\n", | |
"memmap of NPY file\n", | |
"100 loops, best of 3: 3.75 ms per loop\n" | |
] | |
} | |
], | |
"source": [ | |
"print('in memory')\n", | |
"%timeit arr[ind, :] * 1\n", | |
"print()\n", | |
"print('h5py')\n", | |
"%timeit f['/test'][ind, :] * 1\n", | |
"print()\n", | |
"print('memmap of HDF5 file')\n", | |
"%timeit _mmap_h5('test.h5', '/test')[ind, :] * 1\n", | |
"print()\n", | |
"print('memmap of NPY file')\n", | |
"%timeit np.load('test.npy', mmap_mode='r')[ind, :] * 1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Fancy indexing" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Fancy indexing is what we have to use in our particular use-case." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"ind = np.unique(np.random.randint(0, n, n // 100))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"999" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"len(ind)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"in memory\n", | |
"100 loops, best of 3: 2.05 ms per loop\n", | |
"\n", | |
"h5py\n", | |
"10 loops, best of 3: 53.3 ms per loop\n", | |
"\n", | |
"memmap of HDF5 file\n", | |
"100 loops, best of 3: 5.62 ms per loop\n", | |
"\n", | |
"memmap of NPY file\n", | |
"100 loops, best of 3: 5.12 ms per loop\n" | |
] | |
} | |
], | |
"source": [ | |
"print('in memory')\n", | |
"%timeit arr[ind, :] * 1\n", | |
"print()\n", | |
"print('h5py')\n", | |
"%timeit f['/test'][ind, :] * 1\n", | |
"print()\n", | |
"print('memmap of HDF5 file')\n", | |
"%timeit _mmap_h5('test.h5', '/test')[ind, :] * 1\n", | |
"print()\n", | |
"print('memmap of NPY file')\n", | |
"%timeit np.load('test.npy', mmap_mode='r')[ind, :] * 1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Note that h5py uses a [slow algorithm for fancy indexing](https://gist.github.com/rossant/7b4704e8caeb8f173084#gistcomment-1665072), so HDF5 is not the only cause of the slowdown." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
I messed up my first comment, sorry for that. I'm pasting it again for reference:
I'm not entirely sure you're aware what happens during each call. I'd suggest trying this:
%%timeit -n1 -r1
f = h5py.File('test.h5', 'r')
print(f['/test'][:2, :3])
f.close()
Benchmark updated following comment by Stuart Berg.
@rossant, a big part of this is that the fancy-indexing code in h5py uses a naive algorithm based on repeated hyperslab selection, which is quadratic in the number of indices. It was designed/tested for small numbers of indices.
The particular example you have here (0 to 10000 in steps of 10) can be mapped to slices (although of course this is not generally true). In this case the results are:
%%timeit f = h5py.File('test.h5','r')
f['/test'][0:10000:10]
100 loops, best of 3: 12 ms per loop
This is a great argument to improve the implementation of fancy indexing in h5py, but I would hesitate to conclude "HDF5 is slow".
@andrewcollette thanks for your comment, I've updated the benchmarks and the post accordingly. And thanks for doing h5py! Despite the problems we've had with HDF5, I actually like the h5py API and how it fits so naturally with NumPy.
Please also analyse this:
then:
and
no HDF5 involved...