-
-
Save jfsantos/e2ef822c744357a4ed16ec0c885100a3 to your computer and use it in GitHub Desktop.
from keras.models import Sequential | |
from keras.layers import Dense | |
from keras.utils.io_utils import HDF5Matrix | |
import numpy as np | |
def create_dataset(): | |
import h5py | |
X = np.random.randn(200,10).astype('float32') | |
y = np.random.randint(0, 2, size=(200,1)) | |
f = h5py.File('test.h5', 'w') | |
# Creating dataset to store features | |
X_dset = f.create_dataset('my_data', (200,10), dtype='f') | |
X_dset[:] = X | |
# Creating dataset to store labels | |
y_dset = f.create_dataset('my_labels', (200,1), dtype='i') | |
y_dset[:] = y | |
f.close() | |
create_dataset() | |
# Instantiating HDF5Matrix for the training set, which is a slice of the first 150 elements | |
X_train = HDF5Matrix('test.h5', 'my_data', start=0, end=150) | |
y_train = HDF5Matrix('test.h5', 'my_labels', start=0, end=150) | |
# Likewise for the test set | |
X_test = HDF5Matrix('test.h5', 'my_data', start=150, end=200) | |
y_test = HDF5Matrix('test.h5', 'my_labels', start=150, end=200) | |
# HDF5Matrix behave more or less like Numpy matrices with regards to indexing | |
print(y_train[10]) | |
# But they do not support negative indices, so don't try print(X_train[-1]) | |
model = Sequential() | |
model.add(Dense(64, input_shape=(10,), activation='relu')) | |
model.add(Dense(1, activation='sigmoid')) | |
model.compile(loss='binary_crossentropy', optimizer='sgd') | |
# Note: you have to use shuffle='batch' or False with HDF5Matrix | |
model.fit(X_train, y_train, batch_size=32, shuffle='batch') | |
model.evaluate(X_test, y_test, batch_size=32) |
@Shawn-Shan, thanks a lot!
@Shawn-Shan, can we use it with multiple workers?
@Shawn-Shan, can we use it with multiple workers?
I think it should not be used with multiple workers.
@Shawn-Shan
Thx, for your solution!
Reading from HDF5 is extremely slow.
Before I adopt your solution, it is like 200s per epoch for my training.
After I use your cache solution, it is like 17s per epoch.
And for my use case (I use the Sequence interface), I need to set Shuffle=False explicitly.
Thanks for the generator tip @Shawn-Shan. That meant I could actually fit my 200 GB data!
Note that I had to change y_all = self.hf['y_train'][:]
and data = hf['y_train'][:]
since it loads all data into memory. It's much more efficient to just use the shape of the data like so: nrows = self.hf["y_train"].shape[0]
and then set self.total_len = nrows
and train_len = nrows
.
HDF5Matrix is much slower when I read data batches by batches, or use a for loop. Here is a quick modification:
It uses a generator, and basically split the large dataset that couldn't fit into memory as a whole, and split into 100 segments, and generate on each segment.