Issue
I am confused about using huggingface BERT models and about how to make them yield a prediction at a fixed shape, regardless of input size (i.e., input string length).
I tried to call the tokenizer with the parameters padding=True, truncation=True, max_length = 15
, but the prediction output dimensions for inputs = ["a", "a"*20, "a"*100, "abcede"*20000]
are not fixed. What am I missing here?
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = ["a", "a"*20, "a"*100, "abcede"*20000]
for input in inputs:
inputs = tokenizer(input, padding=True, truncation=True, max_length = 15, return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape, input, len(input))
output:
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
torch.Size([1, 3, 768]) a 1
torch.Size([1, 12, 768]) aaaaaaaaaaaaaaaaaaaa 20
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 100
torch.Size([1, 3, 768]) abcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcededeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeab....deabbcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcede 120000
Solution
When you call the tokenizer with only one sentence and padding=True
, truncation=True
, max_length = 15
, it will pad the output sequence to the longest input sequence and truncate if required. Since you are providing only one sentence, the tokenizer can not pad anything because it is already the longest sequence of the batch. That means you can achieve what you want in two ways:
- Provide a batch:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = ["a", "a"*20, "a"*100, "abcede"*200]
inputs = tokenizer(inputs, padding=True, truncation=True, max_length = 15, return_tensors="pt")
print(inputs["input_ids"])
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
Output:
tensor([[ 101, 1037, 102, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0],
[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
2050, 102, 0, 0, 0],
[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
11057, 11057, 11057, 11057, 102],
[ 101, 100, 102, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0]])
torch.Size([4, 15, 768])
- Set
padding="max_length"
:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = ["a", "a"*20, "a"*100, "abcede"*200]
for i in inputs:
inputs = tokenizer(i, padding='max_length', truncation=True, max_length = 15, return_tensors="pt")
print(inputs["input_ids"])
outputs = model(**inputs)
print(outputs.last_hidden_state.shape, i, len(i))
Output:
tensor([[ 101, 1037, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0]])
torch.Size([1, 15, 768]) a 1
tensor([[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
2050, 102, 0, 0, 0]])
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaa 20
tensor([[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
11057, 11057, 11057, 11057, 102]])
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 100
tensor([[101, 100, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0]])
torch.Size([1, 15, 768]) abcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcede 1200
Answered By - cronoik
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.