Issue
Suppose I have 8 gpus on a server. (From 0 to 7)
When I train a simple (and small) model on a gpu #0, it takes about 20 minutes per epoch. However, when I load more than 5 or 6 models on some gpus, for example, 2 experiments per gpu from gpu #0 to #2, (6 in total) the training time per epoch explodes. ( about 1 hour ),
When I train 2 models per gpu for all gpus ( 16 experiments in total ), it takes about 3 hours to complete an epoch.
When I see the CPU utilization, it is fine. But GPU utilization drops.
What is the reason for the drop, and how can I solve the problem?
Solution
There are basically two ways of using multi-GPUs for deep learning:
- Use
torch.nn.DataParallel(module)
(DP)
This function is quite discouraged by the official documentation because it replicates the entire module in all GPUs at each forward pass. At the end of forward pass, the models are destroyed. Therefore, when you have big models it could be an important bottleneck in your training time and even slow it by compared to single GPU. It could be the case for instance when you freeze a large part of big module for fine tuning. That's why you may consider using:
torch.nn.parallel.DistributedDataParallel(module, device_ids=)
(DDP) documentation
This function often requires refactoring your code a little bit more but it improves the efficiency because it copy the models on GPUs only once, at the beginning of the training. The models are persistent over time and the gradients are synchronized after each backward pass via hooks. To go further, you can distributed data and optimizer as well to avoid data transfer. You can do it simply (as well as parallelized modules) using torch-ignite/distributed.
I don't know what kind of method you tried but I encourage you to use DDP instead of DP if you are using it.
Answered By - Valentin Goldité
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.