Issue
I have a dfAB
import pandas as pd
import random
A = [ random.randint(0,100) for i in range(10) ]
B = [ random.randint(0,100) for i in range(10) ]
dfAB = pd.DataFrame({ 'A': A, 'B': B })
dfAB
We can take the quantile function, because I want to know the 75th percentile of the columns:
dfAB.quantile(0.75)
But say now I put some NaNs in the dfAB and re-do the function, obviously its differnt:
dfAB.loc[5:8]=np.nan
dfAB.quantile(0.75)
Basically, when I calculated the mean of the dfAB, I passed skipna to ignore Na's as I didn't want them affecting my stats (I have quite a few in my code, on purpose, and obv making them zero doesn't help)
dfAB.mean(skipna=True)
Thus, what im getting at is whether/how the quantile function addresses NaN's?
Solution
Yes, this appears to be the way that pd.quantile
deals with NaN
values. To illustrate, you can compare the results to np.nanpercentile
, which explicitely Computes the qth percentile of the data along the specified axis, while ignoring nan values (quoted from the docs, my emphasis):
>>> dfAB
A B
0 5.0 10.0
1 43.0 67.0
2 86.0 2.0
3 61.0 83.0
4 2.0 27.0
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 27.0 70.0
>>> dfAB.quantile(0.75)
A 56.50
B 69.25
Name: 0.75, dtype: float64
>>> np.nanpercentile(dfAB, 75, axis=0)
array([56.5 , 69.25])
And see that they are equivalent
Answered By - sacuL
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.