Issue
I'm trying to fine-tune the ReformerModelWithLMHead (google/reformer-enwik8) for NER. I used the padding sequence length same as in the encode method (max_length = max([len(string) for string in list_of_strings])) along with attention_masks. And I got this error:
ValueError: If training, make sure that config.axial_pos_shape factors: (128, 512) multiply to sequence length. Got prod((128, 512)) != sequence_length: 2248. You might want to consider padding your sequence length to 65536 or changing config.axial_pos_shape.
- When I changed the sequence length to 65536, my colab session crashed by getting all the inputs of 65536 lengths.
- According to the second option(changing config.axial_pos_shape), I cannot change it.
I would like to know, Is there any chance to change config.axial_pos_shape while fine-tuning the model? Or I'm missing something in encoding the input strings for reformer-enwik8?
Thanks!
Question Update: I have tried the following methods:
- By giving paramteres at the time of model instantiation:
model = transformers.ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8", num_labels=9, max_position_embeddings=1024, axial_pos_shape=[16,64], axial_pos_embds_dim=[32,96],hidden_size=128)
It gives me the following error:
RuntimeError: Error(s) in loading state_dict for ReformerModelWithLMHead: size mismatch for reformer.embeddings.word_embeddings.weight: copying a param with shape torch.Size([258, 1024]) from checkpoint, the shape in current model is torch.Size([258, 128]). size mismatch for reformer.embeddings.position_embeddings.weights.0: copying a param with shape torch.Size([128, 1, 256]) from checkpoint, the shape in current model is torch.Size([16, 1, 32]).
This is quite a long error.
- Then I tried this code to update the config:
model1 = transformers.ReformerModelWithLMHead.from_pretrained('google/reformer-enwik8', num_labels = 9)
Reshape Axial Position Embeddings layer to match desired max seq length
model1.reformer.embeddings.position_embeddings.weights[1] = torch.nn.Parameter(model1.reformer.embeddings.position_embeddings.weights[1][0][:128])
Update the config file to match custom max seq length
model1.config.axial_pos_shape = 16,128
model1.config.max_position_embeddings = 16*128 #2048
model1.config.axial_pos_embds_dim= 32,96
model1.config.hidden_size = 128
output_model_path = "model"
model1.save_pretrained(output_model_path)
By this implementation, I am getting this error:
RuntimeError: The expanded size of the tensor (512) must match the existing size (128) at non-singleton dimension 2. Target sizes: [1, 128, 512, 768]. Tensor sizes: [128, 768]
Because updated size/shape doesn't match with the original config parameters of pretrained model. The original parameters are: axial_pos_shape = 128,512 max_position_embeddings = 128*512 #65536 axial_pos_embds_dim= 256,768 hidden_size = 1024
Is it the right way I'm changing the config parameters or do I have to do something else?
Is there any example where ReformerModelWithLMHead('google/reformer-enwik8') model fine-tuned.
My main code implementation is as follow:
class REFORMER(torch.nn.Module):
def __init__(self):
super(REFORMER, self).__init__()
self.l1 = transformers.ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8", num_labels=9)
def forward(self, input_ids, attention_masks, labels):
output_1= self.l1(input_ids, attention_masks, labels = labels)
return output_1
model = REFORMER()
def train(epoch):
model.train()
for _, data in enumerate(training_loader,0):
ids = data['input_ids'][0] # input_ids from encode method of the model https://huggingface.co/google/reformer-enwik8#:~:text=import%20torch%0A%0A%23%20Encoding-,def%20encode,-(list_of_strings%2C%20pad_token_id%3D0
input_shape = ids.size()
targets = data['tags']
print("tags: ", targets, targets.size())
least_common_mult_chunk_length = 65536
padding_length = least_common_mult_chunk_length - input_shape[-1] % least_common_mult_chunk_length
#pad input
input_ids, inputs_embeds, attention_mask, position_ids, input_shape = _pad_to_mult_of_chunk_length(self=model.l1,
input_ids=ids,
inputs_embeds=None,
attention_mask=None,
position_ids=None,
input_shape=input_shape,
padding_length=padding_length,
padded_seq_length=None,
device=None,
)
outputs = model(input_ids, attention_mask, labels=targets) # sending inputs to the forward method
print(outputs)
loss = outputs.loss
logits = outputs.logits
if _%500==0:
print(f'Epoch: {epoch}, Loss: {loss}')
for epoch in range(1):
train(epoch)
Solution
First of all, you should note that google/reformer-enwik8
is not a properly trained language model and that you will probably not get decent results from fine-tuning it. enwik8 is a compression challenge and the reformer authors used this dataset for exactly that purpose:
To verify that the Reformer can indeed fit large models on a single core and train fast on long sequences, we train up to 20-layer big Reformers on enwik8 and imagenet64...
This is also the reason why they haven't trained a sub-word tokenizer and operate on character level.
You should also note that the LMHead
is usually used for predicting the next token of a sequence (CLM). You probably want to use a token classification head (i.e. use an encoder ReformerModel and add a linear layer with 9 classes on top+maybe a dropout layer).
Anyway, in case you want to try it still, you can do the following to reduce the memory footprint of the google/reformer-enwik8
reformer:
- Reduce the number of hashes during training:
from transformers import ReformerConfig, ReformerModel
conf = ReformerConfig.from_pretrained('google/reformer-enwik8')
conf.num_hashes = 2 # or maybe even to 1
model = transformers.ReformerModel.from_pretrained("google/reformer-enwik8", config =conf)
After you have finetuned your model, you can increase the number of hashes again to increase the performance (compare Table 2 of the reformer paper).
- Replace axial-position embeddings:
from transformers import ReformerConfig, ReformerModel
conf = ReformerConfig.from_pretrained('google/reformer-enwik8')
conf.axial_pos_embds = False
model = transformers.ReformerModel.from_pretrained("google/reformer-enwik8", config =conf)
This will replace the learned axial positional embeddings with learnable position embeddings like Bert's and do not require the full sequence length of 65536. They are untrained and randomly initialized (i.e. consider a longer training).
Answered By - cronoik
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.