Issue
I have the following tensors and network parameters defined:
X = torch.ones([N,N],dtype=torch.float32).to('cuda:0')
A = Parameter(torch.ones([N,N],dtype=torch.float32)).to('cuda:0')
B = Parameter(torch.ones([N,N],dtype=torch.float32)).to('cuda:1')
Can I have a computation like this:
C = (X @ A).to('cuda:1')
Y = C @ B
In this example, I want to do the computation X @ A
on GPU 0, and C @ B
on GPU 1. Will there be any problem?
It can run with no error. However, I would like to ask if this is a common practice of distributing a model on multiple devices.
Solution
The most popular way of parallelizing computation across multiple GPUs is data parallelism (DP), where the model is copied across devices and the batch is split so that each part runs on a different device. The main functions to do so is DistributedDataParallel
.
The way you described is called "model sharding" and consists on divide the architecture (more precisely the computational graph) on multiple devices but while maintaining the full batch. This method is interesting when training big models especially because the optimizers states that represent a huge part of required memory (two or three for each params) are also divided across GPU unlike DP method. You can have a look on this paper ZeRO: Memory Optimizations Toward Training Trillion Parameter Models for an example of use in Large Langage Models.
In pytorch, the class to use for that is FullyShardedDataParallel
. I recommend to read the dedicated pytorch blog to use it: https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api and the documentation for the (numerous) edge cases.
Answered By - Valentin Goldité
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.