Issue
My input dataframe (or 2D numpy array) shape is 11kx12k
For each element, I check how many values are higher in the given row and column.
For example:
array([[1, 3, 1, 5, 3],
[3, 7, 3, 2, 1],
[2, 3, 1, 8, 9]])
row-wise:
3 1 3 0 1
1 0 1 3 4
3 2 4 1 0
column-wise:
2 1 1 1 1
0 0 0 2 2
1 1 1 0 0
total higher values for that element:
5 2 4 1 2
1 0 1 5 6
4 3 5 1 0
this code works good but for this shape of matrix It takes ~1 hour.
dfT = df.T
rows = df.apply(lambda x: np.sum(df>x),axis=1)
cols = dfT.apply(lambda x: np.sum(dfT>x),axis=1).T
output = rows+cols
I wonder is there any way to do this more efficient way?
I also tried numpy but for this I have splited my 2D to 12kx100 or 12kx200 shapes and merged all arrays again, in the end runtime was close to eachother so couldn't get any progress.
np.sum(matrix > matrix[:,None], axis=1)
Solution
I came up with the following:
from scipy.stats import rankdata
def element_wise_bigger_than(x, axis):
return x.shape[axis] - rankdata(x, method='max', axis=axis)
What you want is similar to a ranking of elements, which basically tells you how many items are smaller than a given element (maybe with subtracting 1 if your ranking is 1, 2, 3, ...
). rankdata
does this ranking, and can do it row-wise or column-wise. We can convert the "smaller than" to "bigger than" by inverting the ranks, relative to the number of items being counted (in each row/column). I.e., subtract the ranks from the number of rows/columns.
For rows/columns without ties (e.g., the third row in your example), this works as expected. There is a slight complication with ties and how they are ranked. Through trial and error, I found that using method='max'
gave the desired behavior. To be frank, I am still wrapping my head on why/how this works, so take the answer with a grain of salt.
Still, it gives the desired output on the input data:
>>> x = np.array([[1, 3, 1, 5, 3], [3, 7, 3, 2, 1], [2, 3, 1, 8, 9]])
>>> element_wise_bigger_than(x, 1)
array([[3, 1, 3, 0, 1],
[1, 0, 1, 3, 4],
[3, 2, 4, 1, 0]])
>>> element_wise_bigger_than(x, 0)
array([[2, 1, 1, 1, 1],
[0, 0, 0, 2, 2],
[1, 1, 1, 0, 0]])
And it can handle your size of data in far less than an hour:
def experiment():
x = np.random.randint(0, 10, size=(11000, 12000))
rows = element_wise_bigger_than(x, 1)
cols = element_wise_bigger_than(x, 0)
output = rows + cols
return output
%%time
experiment()
# CPU times: user 8.68 s, sys: 805 ms, total: 9.49 s
# Wall time: 9.47 s
Takes about 10 seconds. My understanding is the ranking should be similar to a sorting operation. My answer is relying on a scipy/numpy implementation of the rank/sort, rather than a user-written loop/"apply" (which tend to be much slower).
Answered By - Tom
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.