Issue
The kernel size of 3D convolution is defined using depth, height and width in Pytorch or TensorFlow. For example, if we consider a CT/MRI image data with 300 slices, the input tensor can be (1,1,300,128,128), corresponding to (N,C,D,H,W). Then, the kernel size can be (3,3,3) for depth, height and width. When doing 3D convolution, the kernel is passed in 3 directions.
However, I was confused if we change the situation from CT/MRI to a colourful video. Let the video has 300 frames, then the input tensor will be (1,3,300,128,128) because of 3 channels for RGB images. I know that for a single RGB image, the kernel size can be 3X3X3 for channels, height and width. But when it comes to a video, it seems both Pytorch and Tensorflow still use depth, height and width to set the kernel size. My question is, if we still use a kernel of (3,3,3), is there a potential fourth dimension for the colour channels?
Solution
Yes.
Actually the convolution operation occurring in a CNN is one dimension higher than its namesake. The channel dimension is always spanned by the entire kernel though, so there's no sliding along the channel dimension. For example, a 2D convolution layer with kernel size set to 5x5 applied to a 3 channel input is actually using a kernel of shape 3x5x5 (assuming channel first notation). Each output channel is the result of convolving the input with a different 3x5x5 kernel, so there is one of these 3x5x5 kernels for each output channel.
This is the same for videos. A 3D convolution layer is actually performing a 4D convolution in the same way. So an input of shape 1x3x300x128x128 with kernel size set to 3x3x3 will actually be performing 4D convolutions with kernels of shape 3x3x3x3.
Answered By - jodag
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.