Issue
I'm using the Huggingface Transformer package and BERT with PyTorch.
I try to do text classification with CamembertForSequenceClassification.
I can get the result, but I want to challenge more difficult task.
I refer to this literature. In section 4.1 of this document, it is stated that
After training, we drop the softmax activation layer and use BERT's hidden state as the feature vector, which we then use as input for different classification algorithms.
So, I check the modeling_bert.py. There is attention_probs = nn.Softmax(dim=-1)(attention_scores)
If I look at it as per the paper, does it mean to use the attention_scores before passing it through Softmax function? If so, how can I use the attention_scores and apply it to the classification algorithm?
In short, what I want to do is to use the hidden state of BERT and apply it to Logistic Regression and so on.
Thanks for any help.
Solution
They did not mean that Softmax layer, because that one is inside BertAttention. They meant the pooler layer on top of BERT.
I found their repository provided in the paper: https://github.com/axenov/politik-news
It seems when they train, they use the plain BertForSequenceClassification. (Which uses hidden_states -> pooler activation -> linear classifier -> loss)
When they predict, they only use the hidden_states (or in bert_modeling.py it's called sequence_output), then they pass it to a different classifier loaded in BiasPredictor.py:L26.
So if you want to try a different classifier, use it here.
Answered By - Tareq
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.