Monday, November 20, 2023

[FIXED] Compute the order of non-unique array elements

November 20, 2023 arrays, numpy, python No comments

Issue

I'm looking for an efficient method to compute the "order" of each item in a numpy array, with "order" defined as the number of preceding elements equal to the element. Example:

order([4, 2, 3, 2, 6, 4, 4, 6, 2, 4])
[0 0 0 1 0 1 2 1 2 3]

Current solution loops in pure Python and is not fast enough:

def order(A):
    cnt = defaultdict(int)
    O = np.zeros_like(A)
    for i, r in enumerate(A):
        O[i] = cnt[r]
        cnt[r] += 1
    return O

I'm using order to implement scatter:

def scatter(A, c):
    R = A % c
    I = c * order(R) + R
    B = np.full(np.max(I) + 1, -1)
    B[I] = A
    return B

This is useful for multi-threading. For example, if the scattered array contains addresses to write to then no two threads processing the array in parallel will see the same address.

Question is are there numpy built-ins that I'm missing that I can use to make order faster and to remove the explicit looping?

Solution

Since what you're doing is essentially a Pandas cumcount, and Pandas uses NumPy internally, one idea would be to look at how they implemented cumcount, and do the same thing.

If you read the Pandas code for cumcount, it is internally implemented in this way:

Sort the array, keeping track of where each element came from.
Compare each element of the sorted array to the next element. If it is different, it is the start of a new run. (run)
Work out the length of each group. (rep)
Do a cumulative sum, incrementing by 1 for each element which is not part of a new run. (out)
Track how much each group is affected by groups before it which should not count. (out[run])
Repeat the value to subtract by rep.
Undo the initial sort to put elements back in their original place.

Here's how to do the same thing without relying on any of Pandas.

def order(array):
    # https://github.com/pandas-dev/pandas/blob/v1.3.5/pandas/core/groupby/groupby.py#L1493
    if len(array) == 0:
        return np.array([])
    count = len(array)
    # Can remove 'stable' here to increase speed if you
    # don't care what order the order is assigned in
    ind = np.argsort(array, kind='stable')
    array = array[ind]
    run = np.r_[True, array[:-1] != array[1:]]
    rep = np.diff(np.r_[np.nonzero(run)[0], count])
    out = (~run).cumsum()
    out -= np.repeat(out[run], rep)
    rev = np.empty(count, dtype=np.intp)
    rev[ind] = np.arange(count, dtype=np.intp)
    out = out[rev]
    return out

I find that this is approx 10x faster for arrays 1000 elements and larger.

Answered By - Nick ODell

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 20, 2023

[FIXED] Compute the order of non-unique array elements

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels