Tuesday, September 27, 2022

[FIXED] Understanding key_dim and num_heads in tf.keras.layers.MultiHeadAttention

September 27, 2022 keras, machine-learning, python, pytorch, tensorflow No comments

Issue

For example, I have input with shape (1, 1000, 10) (so, src.shape wil be (1, 1000, 10)). Then:

This works

class Model(tf.keras.Model):
        def __init__(self):
            super(Model, self).__init__()
            self.attention1 = tf.keras.layers.MultiHeadAttention(num_heads=20, key_dim=9)
            self.dense = tf.keras.layers.Dense(10, activation="softmax")

        def call(self, src):
            output = self.attention1(src, src)
            output = tf.reshape(output, [1, 10000])
            output = self.dense(output)
            return output

And this:

class Model(tf.keras.Model):
        def __init__(self):
            super(Model, self).__init__()
            self.attention1 = tf.keras.layers.MultiHeadAttention(num_heads=123, key_dim=17)
            self.dense = tf.keras.layers.Dense(10, activation="softmax")

        def call(self, src):
            output = self.attention1(src, src)
            output = tf.reshape(output, [1, 10000])
            output = self.dense(output)
            return output

So, this layer works with whatever num_heads and key_dim but secuence length (i.e. 1000) should be divisible by num_heads. WHY? Is it a bug? For example, the same code for Pytorch doesn't work. Also, what is a key_dim then... Thanks in advance.

Solution

There are two dimensions d_k and d_v in the original paper.

key_dim corresponds to d_k, which can be more or less than d_v. d_k is the size of the key and query dimensions for each head.
d_v = embed_dim/num_head. d_v is the size of the value for each head.

In their paper, Vaswani et al. set d_k = d_v. This, however, is not required. Conceptually, you can have d_k << d_v or even d_k >> d_v. In the former, you will have dimensionality reduction for each key/query in each head and in the latter, you will have dimensionality expansion for each key/query in each attention head.

Answered By - Anirban Mukherjee

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, September 27, 2022

[FIXED] Understanding key_dim and num_heads in tf.keras.layers.MultiHeadAttention

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels