Friday, November 19, 2021

[FIXED] EncoderDecoderModel converts classifier layer of decoder

November 19, 2021 huggingface-transformers, python, pytorch No comments

Issue

I am trying to do named entity recognition using a Sequence-to-Sequence-model. My output is simple IOB-tags, and thus I only want to predict probabilities for 3 labels for each token (IOB).

I am trying a EncoderDecoderModel using the HuggingFace-implementation with a DistilBert as my encoder, and a BertForTokenClassification as my decoder.

First, I import my encoder and decoder:

encoder = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
encoder.save_pretrained("Encoder")

decoder = BertForTokenClassification.from_pretrained('bert-base-uncased',
                                                     num_labels=3,
                                                     output_hidden_states=False,
                                                     output_attentions=False)
decoder.save_pretrained("Decoder")
decoder

When I check my decoder model as shown, I can clearly see the linear classification layer that has out_features=3:

## sample of output:
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=3, bias=True)
)

However, when I combine the two models in my EncoderDecoderModel, it seems that the decoder is converted into a different kind of classifier - now with out_features as the size of my vocabulary:

bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("./Encoder","./Decoder")
bert2bert

## sample of output:
(cls): BertOnlyMLMHead(
      (predictions): BertLMPredictionHead(
        (transform): BertPredictionHeadTransform(
          (dense): Linear(in_features=768, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (decoder): Linear(in_features=768, out_features=30522, bias=True)
      )
    )

Why is that? And how can I keep out_features = 3 in my model?

Solution

Huggingface uses different heads (depending on the network and task) for its models. While a part of these models is the same (such as the Contextualized encoders modules), they vary in the last layer which is the head itself.

For example, for classification problems, they use the XForSequenceClassification heads, where X is the name of the language model such as Bert, Bart, and so forth.

Being said this, the EncoderDecoderModel model uses the language modeling head while the decoder that you have already stored uses the classification head. As EncoderDecoderModel sees these discrepancies, it uses its own LMhead which is a linear layer with in_features of 768 mapped to 30522 as the number of the vocabularies.

To circumvent this issue, you can use the vanilla BERTModel class to output the hidden representations, and then add a linear layer for the classification which takes in the embeddings associated with [CLS] token of BERT with the shape of 768 and then maps it through the linear layer to the output vector of 3, which is the number of your labels.

Answered By - inverted_index

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 19, 2021

[FIXED] EncoderDecoderModel converts classifier layer of decoder

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels