Skip to content

Instantly share code, notes, and snippets.

@jfsantos
Last active May 11, 2020 06:57
Show Gist options
  • Save jfsantos/e2ef822c744357a4ed16ec0c885100a3 to your computer and use it in GitHub Desktop.
Save jfsantos/e2ef822c744357a4ed16ec0c885100a3 to your computer and use it in GitHub Desktop.
from keras.models import Sequential
from keras.layers import Dense
from keras.utils.io_utils import HDF5Matrix
import numpy as np
def create_dataset():
import h5py
X = np.random.randn(200,10).astype('float32')
y = np.random.randint(0, 2, size=(200,1))
f = h5py.File('test.h5', 'w')
# Creating dataset to store features
X_dset = f.create_dataset('my_data', (200,10), dtype='f')
X_dset[:] = X
# Creating dataset to store labels
y_dset = f.create_dataset('my_labels', (200,1), dtype='i')
y_dset[:] = y
f.close()
create_dataset()
# Instantiating HDF5Matrix for the training set, which is a slice of the first 150 elements
X_train = HDF5Matrix('test.h5', 'my_data', start=0, end=150)
y_train = HDF5Matrix('test.h5', 'my_labels', start=0, end=150)
# Likewise for the test set
X_test = HDF5Matrix('test.h5', 'my_data', start=150, end=200)
y_test = HDF5Matrix('test.h5', 'my_labels', start=150, end=200)
# HDF5Matrix behave more or less like Numpy matrices with regards to indexing
print(y_train[10])
# But they do not support negative indices, so don't try print(X_train[-1])
model = Sequential()
model.add(Dense(64, input_shape=(10,), activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='sgd')
# Note: you have to use shuffle='batch' or False with HDF5Matrix
model.fit(X_train, y_train, batch_size=32, shuffle='batch')
model.evaluate(X_test, y_test, batch_size=32)
@markjay4k
Copy link

thank you. this is the kind of example I was looking for.

@yshean
Copy link

yshean commented Apr 27, 2017

I'm still wondering if I could use HDF5Matrix for multiple input/output model in Keras...

@unnikrishnansivakumar
Copy link

unnikrishnansivakumar commented May 1, 2017

I am using keras with theano backend.

model.fit(X_train, y_train, batch_size=32, shuffle='batch')

gives me

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\User\Anaconda2\lib\site-packages\keras\models.py", line 856, in fit
initial_epoch=initial_epoch)
File "C:\Users\User\Anaconda2\lib\site-packages\keras\engine\training.py", line 1498, in fit
initial_epoch=initial_epoch)
File "C:\Users\User\Anaconda2\lib\site-packages\keras\engine\training.py", line 1143, in _fit_loop
ins_batch = _slice_arrays(ins, batch_ids)
File "C:\Users\User\Anaconda2\lib\site-packages\keras\engine\training.py", line 394, in _slice_arrays
return [x[start] for x in arrays]
File "C:\Users\User\Anaconda2\lib\site-packages\keras\utils\io_utils.py", line 65, in getitem
start, stop = key.start, key.stop
AttributeError: 'list' object has no attribute 'start'

Please help

@lamenramen
Copy link

Hi, I just ran your code (example_hdf5matrix.py) and it does not work.

I get the following error trace:

AttributeError                            Traceback (most recent call last)
<ipython-input-1-bd1f3342a35e> in <module>()
     28 
     29 # HDF5Matrix behave more or less like Numpy matrices with regards to indexing
---> 30 print(y_train[10])
     31 # But they do not support negative indices, so don't try print(X_train[-1])
     32 

/home/dnn/.local/lib/python3.5/site-packages/keras/utils/io_utils.py in __getitem__(self, key)
     63 
     64     def __getitem__(self, key):
---> 65         start, stop = key.start, key.stop
     66         if isinstance(key, slice):
     67             if start is None:

AttributeError: 'int' object has no attribute 'start'

@thisisjl
Copy link

Hi, I am using HDF5Matrix to load a dataset and train my model with it. Comparing to a numpy array with the same contents, training a keras model with the HDF5Matrix results in very slow learning. I mean, in the first epoch I get 10% accuracy when using the HDF5Matrix, but 40% accuracy when using the numpy array. I have posted in the keras forum for help as well, see the post for more details. Thank you

@kennethells
Copy link

@lamenramen I got the same error. Did you ever figure it out?

@szymonk92
Copy link

Works well until you create HDF5 using Pandas

@Shawn-Shan
Copy link

HDF5Matrix is much slower when I read data batches by batches, or use a for loop. Here is a quick modification:

file_name = "data.h5"
class DataGenerator(Sequence):
    def __init__(self, file_name, batch_size=1024, data_split=100):
        self.hf = h5py.File(file_name, 'r')
        y_all = self.hf['y_train'][:]
        self.total_len = len(y_all)
        self.batch_size = batch_size
        self.idx = 0
        self.len_segment = int(self.total_len / data_split)
        self.cur_seg_idx = 0
        self.x_cur = self.hf['x_train'][:self.len_segment]
        self.y_cur = self.hf['y_train'][:self.len_segment]

    def next_seg(self):
        self.cur_seg_idx += self.len_segment
        self.x_cur = self.hf['x_train'][self.cur_seg_idx:self.cur_seg_idx+self.len_segment]
        self.y_cur = self.hf['y_train'][self.cur_seg_idx:self.cur_seg_idx+self.len_segment]
        
    def generate(self):
        while 1:
            idx = self.idx
            if idx >= self.len_segment:
                self.next_seg()
                idx = 0
            
            if idx + self.batch_size >= self.len_segment:
                batch_x = self.x_cur[idx:]
                batch_y = self.y_cur[idx:]
            else:
                batch_x = self.x_cur[idx:(idx + self.batch_size)]
                batch_y = self.y_cur[idx:(idx + self.batch_size)]
            self.idx = idx + self.batch_size
            yield batch_x, batch_y

with h5py.File('data.h5', 'r') as hf:
    data = hf['y_train'][:]

train_len = len(data)
batch_size = 1024
x_len = int(train_len / batch_size)
training_generator = DataGenerator(file_name, batch_size=batch_size).generate()

model.fit_generator(generator=training_generator, 
                    epochs=1,
                    steps_per_epoch=x_len, workers=1, 
                    use_multiprocessing=False, 
                    verbose=1)

It uses a generator, and basically split the large dataset that couldn't fit into memory as a whole, and split into 100 segments, and generate on each segment.

@eatsleepraverepeat
Copy link

@Shawn-Shan, thanks a lot!

@plumdeq
Copy link

plumdeq commented Nov 15, 2018

@Shawn-Shan, can we use it with multiple workers?

@dszhengyu
Copy link

@Shawn-Shan, can we use it with multiple workers?

I think it should not be used with multiple workers.

@Shawn-Shan
Thx, for your solution!
Reading from HDF5 is extremely slow.
Before I adopt your solution, it is like 200s per epoch for my training.
After I use your cache solution, it is like 17s per epoch.

And for my use case (I use the Sequence interface), I need to set Shuffle=False explicitly.

@askielboe
Copy link

Thanks for the generator tip @Shawn-Shan. That meant I could actually fit my 200 GB data!

Note that I had to change y_all = self.hf['y_train'][:] and data = hf['y_train'][:] since it loads all data into memory. It's much more efficient to just use the shape of the data like so: nrows = self.hf["y_train"].shape[0] and then set self.total_len = nrows and train_len = nrows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment