Issue
I want to implement a function like "reverse" map of inverse
value of torch.unique
.
For example, torch.unique
function can return the unique values uniques
of long type input x
and a inverse tensor that is a mapping from uniques
to x
.
x = torch.LongTensor([9, 10, 9, 9, 10, 9])
uniques, inverse, counts = torch.unique(x, return_inverse=True, return_counts=True)
# uniques = [9, 10]
# inverse = [0, 1, 0, 0, 1, 0]
# counts = [4, 2]
print((uniques[inverse] == x).all()) # True
For my question, is there some efficient way to get "reverse" inverse back_map
that mapping from x
to uniques
?
def reverse_unique(x): ...
uniques, inverse, counts, back_map = reverse_unique(x)
# uniques = [9, 10]
# inverse = [0, 1, 0, 0, 1, 0]
# counts = [4, 2]
# back_map = [0, 2, 3, 5, 1, 4]
print((x[back_map] == uniques.repeat_interleave(counts)).all()) # True
In above code, the back_map
maps the values of inverse
to the position of the input x
.
I know it not a difficult thing implements this function with python loop, but in my case where the input x
size can reach e8, so the time overhead is intolerable.
So, Is there any high level implementation using pytorch api or the cuda kernel(I tried to use cuda extension to parallelize it but my cuad kernel is slow extremely :sob:)?
__global__ void unique_back_map_kernel(
int32_t num_uni,
int32_t num_x,
int64_t* __restrict__ uniques,
int64_t* __restrict__ cumsum_counts,
int64_t* __restrict__ x,
int64_t* __restrict__ out) {
int32_t n = blockIdx.x * blockDim.x + threadIdx.x;
if (n >= num_uni) {
return;
}
size_t counts = 0;
auto idx = __ldg(&cumsum_counts[n]);
#pragma unroll
for (int64_t i = 0; i < num_x; ++i) {
if (x[i] == uniques[n]) {
out[idx + counts] = i;
counts++;
}
}
}
std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> reverse_unique(at::Tensor x) {
TORCH_CHECK(x.dtype() == at::kLong && x.device().is_cuda());
auto [uniques, inverse, counts] = at::_unique2(x, false, true, true);
counts.cumsum_(0);
// python: cumsum_counts = torch.cat([torch.tensor([0]), cumsum_counts[:-1]])
auto cumsum_counts = at::cat({at::zeros({1}, counts.options()), counts.slice(0, 0, -1)});
auto back_map = at::empty_like(x);
int32_t threads = (uniques.numel() > 256) ? 256 : 32;
int32_t blocks = (uniques.numel() + threads - 1) / threads;
unique_back_map_kernel<<<blocks, threads, 0, c10::cuda::getCurrentCUDAStream()>>>(
uniques.numel(),
x.numel(),
uniques.data_ptr<int64_t>(),
cumsum_counts.data_ptr<int64_t>(),
x.data_ptr<int64_t>(),
back_map.data_ptr<int64_t>());
return std::make_tuple(uniques, inverse, cumsum_counts, back_map);
}
Solution
@alexey-birukov probably means this:
If you read the documentation for torch.unique
you will learn, that it basically just sorts the values and then does a torch.unique_consecutive
:
Currently in the CUDA implementation and the CPU implementation when dim is specified, torch.unique always sort [sic] the tensor at the beginning regardless of the sort argument. Sorting could be slow, so if your input tensor is already sorted, it is recommended to use
torch.unique_consecutive()
which avoids the sorting.
Another look at the documentation of torch.sort
lets you know that it will not only return the sorted values but also the permutation of indices for getting back the original order. This permutation seems to be exactly what you are after.
x = torch.LongTensor([9, 10, 9, 9, 10, 9])
sorted, back_map = torch.sort(x)
uniques, inverse, counts = torch.unique_consecutive(sorted, return_inverse=True, return_counts=True)
Answered By - paleonix
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.