Issue
I have "reference population" (say, v=np.random.rand(100)
) and I want to compute percentile ranks for a given set (say, np.array([0.3, 0.5, 0.7])
).
It is easy to compute one by one:
def percentile_rank(x):
return (v<x).sum() / len(v)
percentile_rank(0.4)
=> 0.4
(actually, there is an ootb scipy.stats.percentileofscore
- but it does not work on vectors).
np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
=> [ 0.33 0.48 0.71]
This produces the expected results, but I have a feeling that there should be a built-in for this.
I can also cheat:
pd.concat([pd.Series([0.3, 0.5, 0.7]),pd.Series(v)],ignore_index=True).rank(pct=True).loc[0:2]
0 0.330097
1 0.485437
2 0.718447
This is bad on two counts:
- I don't want the test data
[0.3, 0.5, 0.7]
to be a part of the ranking. - I don't want to waste time computing ranks for the reference population.
So, what is the idiomatic way to accomplish this?
Solution
Setup:
In [62]: v=np.random.rand(100)
In [63]: x=np.array([0.3, 0.4, 0.7])
Using Numpy broadcasting:
In [64]: (v<x[:,None]).mean(axis=1)
Out[64]: array([ 0.18, 0.28, 0.6 ])
Check:
In [67]: percentile_rank(0.3)
Out[67]: 0.17999999999999999
In [68]: percentile_rank(0.4)
Out[68]: 0.28000000000000003
In [69]: percentile_rank(0.7)
Out[69]: 0.59999999999999998
Answered By - MaxU - stop genocide of UA
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.