Keras Tips & Tricks

`fit_generator`

Can using either threading or multiprocessing for concurrent and parallel processing, respectively, of the data generator.
In the threading approach (model.fit_generator(..., pickle_safe=False)), the generator can be run concurrently (but not parallel) in multiple threads, with each thread pulling the next available batch based on the shared state of the generator and placing it in a shared queue. However, the generator must be threadsafe (i.e. use locks at synchronization points).
Due to the Python global interpreter lock (GIL), the threading option generally does not benefit from >1 worker (i.e. model.fit_generator(..., nb_worker=1) is best). One possible use case in which >1 threads could be beneficial is the presence of exceptionally long IO times, during which the GIL will be released to enable concurrency. Note also that TensorFlow's session.run(...) method also releases the GIL, thus allowing an actual thread to be run in parallel to a training iteration. To achieve the best performance with this approach, the wall-clock time for generating a single batch with the data generator should be less than that of a single training iteration. Otherwise, the data generation will become a bottleneck.
In the multiprocessing approach (model.fit_generator(..., pickle_safe=True)), the generator can be copied and run in parallel by multiple processes. The key thing to realize here is that the multiprocessing approach will (by default) fork the current process for each worker process, and thus, each process will effectively start with a copy of the generator. Subsequently, each process will run their own "copy" of the generator in parallel, with no synchronization. Thus, while any generator will run without error, if one is not careful, this approach can result in the processes generating the exact same batches at (basically) the same time, (i.e. a deterministic generator will be evaluated in the same manner in each process). The issue here is that with n processes, the model may see the same batch for n consecutive steps, and an "epoch" may actually consist of total_batches/n number of unique batches, rather than total_batches. To fix this, the generator can be reformulated in a manner that relies on NumPy random numbers for generating batches, as the GeneratorEnqueuer class will set a random seed with np.random.seed() for each process.
Due to the overhead of (de)serialization, the multiprocessing option generally only benefits from >1 worker (i.e. model.fit_generator(..., nb_worker=8)), and will generally result in much better performance than the threading option.

`ImageDataGenerator`

In light of the above information, the ImageDataGenerator is "threadsafe", so it can be used with the "threading" approach above. However, it is not completely appropriate for the "multiprocessing" approach due to issue described above, despite not throwing any errors. If a ImageDataGenerator generator is used with the multiprocessing approach above, the behavior is such that the first epoch will suffer from the problem of the same, deterministic generator in each process, and thus the same batches will be produced at (basically) the same time by each process. If shuffling is used, then the generators will diverge in subsequent epochs due to the Iterator superclass randomly permuting the indices with index_array = np.random.permutation(n) at the start of each epoch, making use of the random seed set for each process.

Hi @mushan09, generally the performance will improve in the "threading" approach only if the IO times are long (during which the GIL is released) and there is useful computation that could be performed during that time on the next batch. In that case, using model.fit_generator(..., pickle_safe=False, nb_worker=M) where M > 1 could be beneficial. If this is not the case, then the threading approach offers little benefit. For the "multiprocessing" approach, the performance will generally improve only if the cost of computation & IO is higher than the (de)serialization costs.

Overall, if you have cheap data generation, then parallelization via either method is not likely to be beneficial. In my experience, when I had an expensive data gen setup (loading images, and then lots of manipulations), the multiprocessing approach was beneficial.

That being said, I might recommend replacing the ImageDataGenerator with a tf.data.DataSet approach (assuming you are using Keras with TensorFlow via the tf.Keras API). If you can stuff everything into the TF API, performance is generally better (if nothing else then at least due to minimizing data copies during sess.run from Python to C++ land). In that setup, parallelization can be achieved in multiple areas, such as dataset.map(..., num_parallel_calls=M). This file contains some examples using it, and this guide is nice if you need to get fancier. It still may not help much though if the data gen is super cheap.

dusenberrymw/keras_tips_and_tricks.md

Keras Tips & Tricks

`fit_generator`

`ImageDataGenerator`

dusenberrymw commented Jun 25, 2019

Uh oh!