Issue
I was reading the paper's code "Attention is All You Need" . Code linked at here.
I found this term called temperature. How is it related to the Q,K,V formula for Attention.
My understanding of Self Attention is
Attention = Softmax(matmul(Q,K.T),dim=-1)
Output = Attention.V
Looking for something to correct or augment my understanding.
Solution
In the paper they define "Scaled Dot-Product Attention" as :
Attention(Q, K, V) = matmul(softmax(matmul(Q,K.T) / sqrt(dk)), V)
where dk is the dimension of queries (Q) and keys(K)
In the implementation, temperature seems to be the square root of dk, as it's called from the init part of MultiHeadAttention class :
self.attention = ScaledDotProductAttention(temperature=d_k ** 0.5)
and it's used in ScaledDotProductAttention class which implements the formula above:
attn = torch.matmul(q / self.temperature, k.transpose(2, 3))
ScaledDotProductAttention class : https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/fec78a687210851f055f792d45300d27cc60ae41/transformer/Modules.py#L7
Answered By - Olga Bogdanova
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.