Issue
I've been trying to replace missing values in a Pandas dataframe, but without success. I tried the .fillna
method and also tried to loop through the entire data set, checking each cell and replacing NaNs with a chosen value. However, in both cases, Python executes the script without throwing up any errors, but the NaN values remain.
When I dug a bit deeper, I discovered behaviour that seems erratic to me, best demonstrated with an example:
In[ ] X['Smokinginpregnancy'].head()
Out[ ]
Index
E09000002 NaN
E09000003 5.216126
E09000004 10.287496
E09000005 3.090379
E09000006 6.080041
Name: Smokinginpregnancy, dtype: float64
I know for a fact that the first item in this column is missing and pandas recognises it as NaN. In fact, if I call this item on its own, python tells me it's NaN:
In [ ] X['Smokinginpregnancy'][0]
Out [ ]
nan
However, when I test whether it's NaN, python returns False.
In [ ] X['Smokinginpregnancy'][0] == np.nan
Out [ ] False
I suspect that when .fillna
is being executed, python checks whether the item is NaN but gets back a False, so it continues, leaving the cell alone.
Does anyone know what's going on? Any solutions? (apart from opening the csv file in excel and then manually replacing the values.)
I'm using Anaconda's Python 3 distribution.
Solution
You are doing:
X['Smokinginpregnancy'][0] == np.nan
This is guaranteed to return False
because all NaNs compare unequal to everything by IEEE754 standard:
>>> x = float('nan')
>>> x == x
False
>>> x == 1
False
>>> x == float('nan')
False
See also here.
You have to use math.isnan
to check for NaNs:
>>> math.isnan(x)
True
Or numpy.isnan
So use:
numpy.isnan(X['Smokinginpregnancy'][0])
Regarding pandas.fillna
note that this function returns the filled array. Maybe you did something like:
X.fillna(...)
without reassigning X
? Alternatively you must pass inplace=True
to mutate the dataframe on which you are calling the method.
Answered By - Bakuriu
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.