Issue
I'm using the following code to extract descriptors from images using a Visual Transformer (vit_b_16) but I get the error: RuntimeError: shape '[128, 3, 5, 4, 5, 4]' is invalid for input of size 185856. Does anyone know what I'm doing wrong and how I can fix it?
def img_to_patch(x, patch_size, flatten_channels=True):
B, C, H, W = x.shape
x = x.reshape(B, C, H//patch_size, patch_size, W//patch_size, patch_size)
x = x.permute(0, 2, 4, 1, 3, 5) # [B, H', W', C, p_H, p_W]
x = x.flatten(1,2) # [B, H'*W', C, p_H, p_W]
if flatten_channels:
x = x.flatten(2,4) # [B, H'*W', C*p_H*p_W]
return x
class AttentionBlock(nn.Module):
def __init__(self, embed_dim, hidden_dim, num_heads, dropout=0.0):
super().__init__()
self.layer_norm_1 = nn.LayerNorm(embed_dim)
self.attn = nn.MultiheadAttention(embed_dim, num_heads)
self.layer_norm_2 = nn.LayerNorm(embed_dim)
self.linear = nn.Sequential(
nn.Linear(embed_dim, hidden_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, embed_dim),
nn.Dropout(dropout)
)
def forward(self, x):
inp_x = self.layer_norm_1(x)
x = x + self.attn(inp_x, inp_x, inp_x)[0]
x = x + self.linear(self.layer_norm_2(x))
return x
class VisionTransformer(nn.Module):
def __init__(self, embed_dim, hidden_dim, num_channels, num_heads, num_layers, num_classes, patch_size, num_patches, dropout=0.0):
super().__init__()
self.patch_size = patch_size
# Layers/Networks
self.input_layer = nn.Linear(num_channels*(patch_size**2), embed_dim)
self.transformer = nn.Sequential(*[AttentionBlock(embed_dim, hidden_dim, num_heads, dropout=dropout) for _ in range(num_layers)])
self.mlp_head = nn.Sequential(
nn.LayerNorm(embed_dim),
nn.Linear(embed_dim, num_classes)
)
self.dropout = nn.Dropout(dropout)
# Parameters/Embeddings
self.cls_token = nn.Parameter(torch.randn(1,1,embed_dim))
self.pos_embedding = nn.Parameter(torch.randn(1,1+num_patches,embed_dim))
def forward(self, x):
# Preprocess input
print(x.shape)
x = img_to_patch(x, self.patch_size)
print(x.shape)
B, T, _ = x.shape
x = self.input_layer(x)
# Add CLS token and positional encoding
cls_token = self.cls_token.repeat(B, 1, 1)
x = torch.cat([cls_token, x], dim=1)
x = x + self.pos_embedding[:,:T+1]
# Apply Transforrmer
x = self.dropout(x)
x = x.transpose(0, 1)
x = self.transformer(x)
# Perform classification prediction
cls = x[0]
out = self.mlp_head(cls)
return out
class ViT(pl.LightningModule):
def __init__(self, model_kwargs, lr):
super().__init__()
self.save_hyperparameters()
self.model = VisionTransformer(**model_kwargs)
# self.example_input_array = next(iter(train_loader))[0]
def forward(self, x):
return self.model(x)
and I'm initialising the Transformer like this:
if network_variant == 'vb16':
net = ViT(model_kwargs={
'embed_dim': 256,
'hidden_dim': 512,
'num_heads': 8,
'num_layers': 6,
'patch_size': 4,
'num_channels': 3,
'num_patches': 64,
'num_classes': num_classes,
'dropout': 0.2
},
lr=3e-4)
This is my first time using PyTorch and Vision Transformers so I'm really not sure what I'm doing wrong.
Solution
The error is saying that numpy is trying to fill a matrix with dimensions 128 * 35454, which requires 153,600 elements. However, the data before the reshape has 185,856 elements. Most likely you're miscounting some indices. The difference is 32256=12843*21, which provides some hints about which indices you're probably miscounting... The only place you're calling reshape() is the second line of img_to_patch()
, so I assume that's where the error comes from. (Including the traceback in your question would help confirm this.)
Check the sizes of the variables leading up to that line and confirm they're what you expect.
Looking at your code, the most likely problem is that you're rounding off when dividing H
and W
by patch_size
. If H
and W
aren't multiples of patch_size
, you'll need something to decide which pixels to drop: np.reshape()
won't make that decision on its own.
Answered By - Sarah Messer
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.