Issue
I have twenty-one ANN training processes to run.
I have four computers in my hand. Three of them are equipped with GPUs (two GPUs each).
What would be the best way to run processes to active the training in the least amount of time?
Solution
It's impossible to answer this question definitively. Directionally, some of the things you need to think about are:
Have you tuned your model to be bottlenecked by the GPU rather than CPU? This is a best practice for DL training.
Do you have a well tuned set of training hyperparameters? With proper tuning, you can reduce the speed of training dramatically - like orders of magnitude. (It's possible to get SOTA performance on CIFAR10 in 34 seconds of training on a V100. If you scratch around yourself for a few days, it might take 30 min of training to get SOTA accuracy).
Do you have the ability to scale the batch size? The "Training Bert in 76 minutes" paper speaks to a couple of techniques, from warmup to a new optimizer that's basically LARS + Adam.
Do you have experience scaling to multiple GPUs on a single machine? Generally not that hard in TF2/Keras. Do you have experience scaling to multiple machines? I haven't done it but assume it's a little tougher.
TLDR if you don't have a lot of confidence in all of the above, you are directionally better off running 7 models in parallel.
If you have a lot of confidence in the above, you can try to scale to all 7 GPUs.
If you are in between you can try running 4 training jobs in parallel, one per machine, and for three of them, use the dual GPU.
Answered By - Yaoshiang
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.