Issue
I am trying to do named entity recognition using a Sequence-to-Sequence-model. My output is simple IOB-tags, and thus I only want to predict probabilities for 3 labels for each token (IOB).
I am trying a EncoderDecoderModel using the HuggingFace-implementation with a DistilBert as my encoder, and a BertForTokenClassification as my decoder.
First, I import my encoder and decoder:
encoder = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
encoder.save_pretrained("Encoder")
decoder = BertForTokenClassification.from_pretrained('bert-base-uncased',
num_labels=3,
output_hidden_states=False,
output_attentions=False)
decoder.save_pretrained("Decoder")
decoder
When I check my decoder model as shown, I can clearly see the linear classification layer that has out_features=3:
## sample of output:
)
(dropout): Dropout(p=0.1, inplace=False)
(classifier): Linear(in_features=768, out_features=3, bias=True)
)
However, when I combine the two models in my EncoderDecoderModel, it seems that the decoder is converted into a different kind of classifier - now with out_features as the size of my vocabulary:
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("./Encoder","./Decoder")
bert2bert
## sample of output:
(cls): BertOnlyMLMHead(
(predictions): BertLMPredictionHead(
(transform): BertPredictionHeadTransform(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(decoder): Linear(in_features=768, out_features=30522, bias=True)
)
)
Why is that? And how can I keep out_features = 3 in my model?
Solution
Huggingface uses different heads (depending on the network and task) for its models. While a part of these models is the same (such as the Contextualized encoders modules), they vary in the last layer which is the head itself.
For example, for classification problems, they use the XForSequenceClassification
heads, where X
is the name of the language model such as Bert, Bart, and so forth.
Being said this, the EncoderDecoderModel
model uses the language modeling head while the decoder that you have already stored uses the classification head. As EncoderDecoderModel
sees these discrepancies, it uses its own LMhead
which is a linear layer with in_features of 768 mapped to 30522 as the number of the vocabularies.
To circumvent this issue, you can use the vanilla BERTModel class to output the hidden representations, and then add a linear layer for the classification which takes in the embeddings associated with [CLS]
token of BERT with the shape of 768 and then maps it through the linear layer to the output vector of 3, which is the number of your labels.
Answered By - inverted_index
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.