Issue
My server has two GPUs, How can I use two GPUs for training at the same time to maximize their computing power? Is my code below correct? Does it allow my model to be properly trained?
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.bert = pretrained_model
# for param in self.bert.parameters():
# param.requires_grad = True
self.linear = nn.Linear(2048, 4)
#def forward(self, input_ids, token_type_ids, attention_mask):
def forward(self, input_ids, attention_mask):
batch = input_ids.size(0)
#output = self.bert(input_ids, token_type_ids, attention_mask).pooler_output
output = self.bert(input_ids, attention_mask).last_hidden_state
print('last_hidden_state',output.shape) # torch.Size([1, 768])
#output = output.view(batch, -1) #
output = output[:,-1,:]#(batch_size, hidden_size*2)(batch_size,1024)
output = self.linear(output)
return output
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.device_count() > 1:
print("Use", torch.cuda.device_count(), 'gpus')
model = MyModel()
model = nn.DataParallel(model)
model = model.to(device)
Solution
There are two different ways to train on multiple GPUs:
- Data Parallelism = splitting a large batch that can't fit into a single GPU memory into multiple GPUs, so every GPU will process a small batch that can fit into its GPU
- Model Parallelism = splitting the layers within the model into different devices is a bit tricky to manage and deal with.
Please refer to this post for more information
To do Data Parallelism in pure PyTorch, please refer to this example that I created a while back to the latest changes of PyTorch (as of today, 1.12).
To utilize other libraries to do multi-GPU training without engineering many things, I would suggest using PyTorch Lightning as it has a straightforward API and good documentation to learn how to do multi-GPU training using Data Parallelism.
Update: 2022/10/25
Here is a video explaining in much details about different types of distributed training: https://youtu.be/BPYOsDCZbno?t=1011
Answered By - Mazen
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.