Hi guys. I am training a Densenet121 (GeForce GTX 1080 Ti) but it is incredibly slow - it took 4 days to reach just epoch 15. Can you suggest any way to accelerate the training?
Time it takes to complete an epoch depends on the size of your dataset. Additionally if you using augmentation this to increase the size of your dataset this will further inflate the time it takes to complete an epoch. In case your augmentation is generated on the fly via fit_generator in Keras the ‘steps_per_epoch’ argument of the fit_generator tells Keras how many examples to generate from you raw data. This usually equals number of training examples divided by the size of your mini batch. In case of really small training set it can be further inflated several times.
One thing to keep in mind is that since images are generated on the fly, for example using the ‘flow_from_directory’ generator, the generation itself may be slow. If so it will become a bottleneck for the training and you GPU will not be fully utilized. In Keras fit_genrator has ‘workers’ argument, which by default is 1 - meaning the generation process will likely run on a single core of the CPU. If you set the ‘workers = 8’ or ‘workers = 12’ depending how many virtual cores your CPU has it may speed up the generation process. See an example below:
GPU load with workers = 1
GPU load with workers = 12
As an experiment, GPU load with workers = 12 and placing the dataset on an SSD
Keep in mind that the number of workers will improve the generation speed, but may also overwhelm the system. So if you want to be able to run the training in the background, while doing other stuff with your computer keeping your CPU at 80-90% (by e.g. using 10 rather than 12 workers) load may be a better idea.
Thank you for your detailed explanation! It is a good point. My training runs now 4 times faster, that’s great!