Thursday, January 11, 2024

[FIXED] for each element of 2d array sum higher elements in the row and column

January 11, 2024 matrix, numpy, pandas, python No comments

Issue

My input dataframe (or 2D numpy array) shape is 11kx12k

For each element, I check how many values are higher in the given row and column.

For example:

array([[1, 3, 1, 5, 3],
       [3, 7, 3, 2, 1],
       [2, 3, 1, 8, 9]])

row-wise:

3   1   3   0   1
1   0   1   3   4
3   2   4   1   0

column-wise:

2   1   1   1   1
0   0   0   2   2
1   1   1   0   0

total higher values for that element:

5   2   4   1   2
1   0   1   5   6
4   3   5   1   0

this code works good but for this shape of matrix It takes ~1 hour.

dfT = df.T
rows = df.apply(lambda x: np.sum(df>x),axis=1)
cols = dfT.apply(lambda x: np.sum(dfT>x),axis=1).T
output = rows+cols

I wonder is there any way to do this more efficient way?

I also tried numpy but for this I have splited my 2D to 12kx100 or 12kx200 shapes and merged all arrays again, in the end runtime was close to eachother so couldn't get any progress.

np.sum(matrix > matrix[:,None], axis=1)

Solution

I came up with the following:

from scipy.stats import rankdata

def element_wise_bigger_than(x, axis):
    return x.shape[axis] - rankdata(x, method='max', axis=axis)

What you want is similar to a ranking of elements, which basically tells you how many items are smaller than a given element (maybe with subtracting 1 if your ranking is 1, 2, 3, ...). rankdata does this ranking, and can do it row-wise or column-wise. We can convert the "smaller than" to "bigger than" by inverting the ranks, relative to the number of items being counted (in each row/column). I.e., subtract the ranks from the number of rows/columns.

For rows/columns without ties (e.g., the third row in your example), this works as expected. There is a slight complication with ties and how they are ranked. Through trial and error, I found that using method='max' gave the desired behavior. To be frank, I am still wrapping my head on why/how this works, so take the answer with a grain of salt.

Still, it gives the desired output on the input data:

>>> x = np.array([[1, 3, 1, 5, 3], [3, 7, 3, 2, 1], [2, 3, 1, 8, 9]])
>>> element_wise_bigger_than(x, 1)
array([[3, 1, 3, 0, 1],
       [1, 0, 1, 3, 4],
       [3, 2, 4, 1, 0]])

>>> element_wise_bigger_than(x, 0)
array([[2, 1, 1, 1, 1],
       [0, 0, 0, 2, 2],
       [1, 1, 1, 0, 0]])

And it can handle your size of data in far less than an hour:

def experiment():
    x = np.random.randint(0, 10, size=(11000, 12000))
    rows = element_wise_bigger_than(x, 1)
    cols = element_wise_bigger_than(x, 0)
    output = rows + cols
    return output

%%time
experiment()
# CPU times: user 8.68 s, sys: 805 ms, total: 9.49 s
# Wall time: 9.47 s

Takes about 10 seconds. My understanding is the ranking should be similar to a sorting operation. My answer is relying on a scipy/numpy implementation of the rank/sort, rather than a user-written loop/"apply" (which tend to be much slower).

Answered By - Tom

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 11, 2024

[FIXED] for each element of 2d array sum higher elements in the row and column

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels