Monday, January 3, 2022

[FIXED] Grouping indices of unique elements in numpy

January 03, 2022 numpy, python, python-2.7 No comments

Issue

I have many large (>100,000,000) lists of integers that contain many duplicates. I want to get the indices where each of the element occur. Currently I am doing something like this:

import numpy as np
from collections import defaultdict

a = np.array([1, 2, 6, 4, 2, 3, 2])
d=defaultdict(list)
for i,e in enumerate(a):
    d[e].append(i)

d
defaultdict(<type 'list'>, {1: [0], 2: [1, 4, 6], 3: [5], 4: [3], 6: [2]})

This method of iterating through each element is time consuming. Is there a efficient or vectorized way to do this?

Edit1 I tried the methods of Acorbe and Jaime on the following

a = np.random.randint(2000, size=10000000)

The results are

original: 5.01767015457 secs
Acorbe: 6.11163902283 secs
Jaime: 3.79637312889 secs

Solution

This is very similar to what was asked here, so what follows is an adaptation of my answer there. The simplest way to vectorize this is to use sorting. The following code borrows a lot from the implementation of np.unique for the upcoming version 1.9, which includes unique item counting functionality, see here:

>>> a = np.array([1, 2, 6, 4, 2, 3, 2])
>>> sort_idx = np.argsort(a)
>>> a_sorted = a[idx]
>>> unq_first = np.concatenate(([True], a_sorted[1:] != a_sorted[:-1]))
>>> unq_items = a_sorted[unq_first]
>>> unq_count = np.diff(np.nonzero(unq_first)[0])

and now:

>>> unq_items
array([1, 2, 3, 4, 6])
>>> unq_count
array([1, 3, 1, 1, 1], dtype=int64)

To get the positional indices for each values, we simply do:

>>> unq_idx = np.split(sort_idx, np.cumsum(unq_count))
>>> unq_idx
[array([0], dtype=int64), array([1, 4, 6], dtype=int64), array([5], dtype=int64),
 array([3], dtype=int64), array([2], dtype=int64)]

And you can now construct your dictionary zipping unq_items and unq_idx.

Note that unq_count doesn't count the occurrences of the last unique item, because that is not needed to split the index array. If you wanted to have all the values you could do:

>>> unq_count = np.diff(np.concatenate(np.nonzero(unq_first) + ([a.size],)))
>>> unq_idx = np.split(sort_idx, np.cumsum(unq_count[:-1]))

Answered By - Jaime

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 3, 2022

[FIXED] Grouping indices of unique elements in numpy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels