Thursday, June 23, 2022

[FIXED] Pytorch - Caught StopIteration in replica 1 on device 1 error while Training on GPU

June 23, 2022 bert-language-model, python-3.x, pytorch No comments

Issue

I am trying to train a BertPunc model on the train2012 data used in the git link: https://github.com/nkrnrnk/BertPunc. While running on the server, with 4 GPUs enabled, below is the error I get:

StopIteration: Caught StopIteration in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/stenoaimladmin/.local/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/stenoaimladmin/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/stenoaimladmin/notebooks/model_BertPunc.py", line 16, in forward
    x = self.bert(x)
  File "/home/stenoaimladmin/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/stenoaimladmin/anaconda3/lib/python3.8/site-packages/pytorch_pretrained_bert/modeling.py", line 861, in forward
    sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask,
  File "/home/stenoaimladmin/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/stenoaimladmin/anaconda3/lib/python3.8/site-packages/pytorch_pretrained_bert/modeling.py", line 727, in forward
    extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration

From the link: https://github.com/huggingface/transformers/issues/8145, this appears to be happening when the data gets moved back and forth between multiple GPUs.

As per the git link: https://github.com/interpretml/interpret-text/issues/117, we need to downgrade PyTorch version to 1.4 from 1.7 which I use currently. For me downgrading the version isnt an option as I have other scripts that use Torch 1.7 version. What should I do to overcome this error?

I cant put the whole code here as there are too many lines, but here is the snippet that gives me the error:

bert_punc, optimizer, best_val_loss = train(bert_punc, optimizer, criterion, epochs_top, 
        data_loader_train, data_loader_valid, save_path, punctuation_enc, iterations_top, best_val_loss=1e9)

Here is my DataParallel code:

   bert_punc = nn.DataParallel(BertPunc(segment_size, output_size, dropout)).cuda()

I tried changing the Dataparallel line to divert the training to only 1 GPU , out of 4 present. But that gave me a space issue, and hence had to revert the code back to default.

Here is the link to all scripts that I am using: https://github.com/nkrnrnk/BertPunc Please advice.

Solution

change

extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility

extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) # fp16 compatibility

For more details, see https://github.com/vid-koci/bert-commonsense/issues/6

Answered By - xiaoou wang

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, June 23, 2022

[FIXED] Pytorch - Caught StopIteration in replica 1 on device 1 error while Training on GPU

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels