Issue
I would like to use state-of-the-art LM T5 to get sentence embedding vector. I found this repository https://github.com/UKPLab/sentence-transformers As I know, in BERT I should take the first token as [CLS] token, and it will be the sentence embedding. In this repository I see the same behaviour on T5 model:
cls_tokens = output_tokens[:, 0, :] # CLS token is first token
Does this behaviour correct? I have taken encoder from T5 and encoded two phrases with it:
"I live in the kindergarden"
"Yes, I live in the kindergarden"
The cosine similarity between them was only "0.2420".
I just need to understand how sentence embedding works - should I train network to find similarity to reach correct results? Or I it is enough of base pretrained language model?
Solution
In order to obtain the sentence embedding from the T5, you need to take the take the last_hidden_state
from the T5 encoder output:
model.encoder(input_ids=s, attention_mask=attn, return_dict=True)
pooled_sentence = output.last_hidden_state # shape is [batch_size, seq_len, hidden_size]
# pooled_sentence will represent the embeddings for each word in the sentence
# you need to sum/average the pooled_sentence
pooled_sentence = torch.mean(pooled_sentence, dim=1)
You have now a sentence embeddings from T5
Answered By - Mihai Ilie
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.