Tuesday, May 10, 2022

[FIXED] RuntimeError: shape '[128, 3, 5, 4, 5, 4]' is invalid for input of size 185856

May 10, 2022 computer-vision, python, pytorch, pytorch-lightning No comments

Issue

I'm using the following code to extract descriptors from images using a Visual Transformer (vit_b_16) but I get the error: RuntimeError: shape '[128, 3, 5, 4, 5, 4]' is invalid for input of size 185856. Does anyone know what I'm doing wrong and how I can fix it?

def img_to_patch(x, patch_size, flatten_channels=True):

        B, C, H, W = x.shape
        x = x.reshape(B, C, H//patch_size, patch_size, W//patch_size, patch_size)
        x = x.permute(0, 2, 4, 1, 3, 5) # [B, H', W', C, p_H, p_W]
        x = x.flatten(1,2)              # [B, H'*W', C, p_H, p_W]
        if flatten_channels:
            x = x.flatten(2,4)          # [B, H'*W', C*p_H*p_W]
        return x

class AttentionBlock(nn.Module):

    def __init__(self, embed_dim, hidden_dim, num_heads, dropout=0.0):

        super().__init__()

        self.layer_norm_1 = nn.LayerNorm(embed_dim)
        self.attn = nn.MultiheadAttention(embed_dim, num_heads)
        self.layer_norm_2 = nn.LayerNorm(embed_dim)
        self.linear = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, embed_dim),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        inp_x = self.layer_norm_1(x)
        x = x + self.attn(inp_x, inp_x, inp_x)[0]
        x = x + self.linear(self.layer_norm_2(x))
        return x


class VisionTransformer(nn.Module):
    
    def __init__(self, embed_dim, hidden_dim, num_channels, num_heads, num_layers, num_classes, patch_size, num_patches, dropout=0.0):

        super().__init__()

        self.patch_size = patch_size

        # Layers/Networks
        self.input_layer = nn.Linear(num_channels*(patch_size**2), embed_dim)
        self.transformer = nn.Sequential(*[AttentionBlock(embed_dim, hidden_dim, num_heads, dropout=dropout) for _ in range(num_layers)])
        self.mlp_head = nn.Sequential(
            nn.LayerNorm(embed_dim),
            nn.Linear(embed_dim, num_classes)
        )
        self.dropout = nn.Dropout(dropout)

        # Parameters/Embeddings
        self.cls_token = nn.Parameter(torch.randn(1,1,embed_dim))
        self.pos_embedding = nn.Parameter(torch.randn(1,1+num_patches,embed_dim))

    def forward(self, x):
        # Preprocess input
        print(x.shape)
        x = img_to_patch(x, self.patch_size)
        print(x.shape)
        B, T, _ = x.shape
        x = self.input_layer(x)

        # Add CLS token and positional encoding
        cls_token = self.cls_token.repeat(B, 1, 1)
        x = torch.cat([cls_token, x], dim=1)
        x = x + self.pos_embedding[:,:T+1]

        # Apply Transforrmer
        x = self.dropout(x)
        x = x.transpose(0, 1)
        x = self.transformer(x)

        # Perform classification prediction
        cls = x[0]
        out = self.mlp_head(cls)
        return out

class ViT(pl.LightningModule):

    def __init__(self, model_kwargs, lr):
        super().__init__()
        self.save_hyperparameters()
        self.model = VisionTransformer(**model_kwargs)
       # self.example_input_array = next(iter(train_loader))[0]

    def forward(self, x):
        return self.model(x)

and I'm initialising the Transformer like this:

if network_variant == 'vb16':
    net = ViT(model_kwargs={
               'embed_dim': 256,
               'hidden_dim': 512,
               'num_heads': 8,
               'num_layers': 6,
               'patch_size': 4,
               'num_channels': 3,
               'num_patches': 64,
               'num_classes': num_classes,
               'dropout': 0.2
                },
                lr=3e-4)

This is my first time using PyTorch and Vision Transformers so I'm really not sure what I'm doing wrong.

Solution

The error is saying that numpy is trying to fill a matrix with dimensions 128 * 35454, which requires 153,600 elements. However, the data before the reshape has 185,856 elements. Most likely you're miscounting some indices. The difference is 32256=12843*21, which provides some hints about which indices you're probably miscounting... The only place you're calling reshape() is the second line of img_to_patch(), so I assume that's where the error comes from. (Including the traceback in your question would help confirm this.)

Check the sizes of the variables leading up to that line and confirm they're what you expect.

Looking at your code, the most likely problem is that you're rounding off when dividing H and W by patch_size. If H and W aren't multiples of patch_size, you'll need something to decide which pixels to drop: np.reshape() won't make that decision on its own.

Answered By - Sarah Messer

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, May 10, 2022

[FIXED] RuntimeError: shape '[128, 3, 5, 4, 5, 4]' is invalid for input of size 185856

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels