Saturday, February 5, 2022

[FIXED] Launching the right number of CUDA blocks for a custom PyTorch activation function

February 05, 2022 c++, cuda, libtorch, pytorch No comments

Issue

I'm currently writing a CUDA kernel for a custom operation (an activation) for PyTorch, but I'm quite unfamiliar with any form of GPU programming. For reference, I was following the Custom C++ & CUDA extension tutorial.

A simplified example of the sort of operation I want to do:

Let's say I have an input tensor, X_in, which can be of any shape and dims (e.g. something like (16, 3, 50, 100) ). Let's say I also have a 1D tensor, T (for example, T can be a tensor of shape (100,) ).

For each value in X_in, I want to calculate an "index" value that should be < len(T). Then the output would be basically the value of that index in T, multiplied or added with some constant. This is something like a "lookup table" operation.

An example kernel:

__global__ void inplace_lookup_kernel(
    const scalar_t* __restrict__ T,
    scalar_t* __restrict__ X_in,
    const int N) {

    const int i = blockIdx.x * blockDim.x + threadIdx.x;
    const int idx = int(X_in[i]) % N;

    X_in[i] = 5 * T[idx] - 3;
}

I also wish to do the operation in-place, which is why the output is being computed into X_in.

My question is, for an operation like this, which is to be applied pointwise on each value of X_in, how to determine the way to launch a good number of threads/blocks? In the Custom C++ & CUDA extension tutorial, they do so by:

const int threads = 1024;
const dim3 blocks((state_size + threads - 1) / threads, batch_size);

For their use-case, the operation (an lstm variant) has a specific format of input, and thus a fixed number of dimensions, from which blocks can be calculated.

But the operation I'm writing should accept inputs of any dimensions and shape. What is the right way to calculate the block number for this situation?

I'm currently doing the following:

const int threads = 1024;
const int nelems = int(X_in.numel());
const dim3 blocks((nelems + threads - 1) / threads);

However I'm doing this by intuition, and not with any certainty. Is there a better or correct way to do this? And is there any computational advantage if I define blocks in the format blocks(otherdim_size, batch_size) like in the tutorial?

Solution

I'm speculating here, but - since your operations seems to be completely elementwise (w.r.t. X_in); and you don't seem to using interesting SM-specific resources like shared memory, nor a lot of registers per thread, I don't think the grid partition matters all that much. Just treat X_in as a 1-D array according to its layout in memory, and use a 1D grid with a block size of, oh, 256, or 512 or 1024.

Of course - always try out your choices to make sure you don't get bitten by unexpected behavior.

Answered By - einpoklum

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, February 5, 2022

[FIXED] Launching the right number of CUDA blocks for a custom PyTorch activation function

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels