Issue
I am currently playing with Kaggle Titanic dataset (train.csv)
- I can load the data fine.
- I understood that some data in
Embarked
column hasnan
value. But when I tried to filter it using the following code, I am getting an empty array
import pandas as pd
df = df.read_csv(<file_loc>, header=0)
df[df.Embarked == 'nan']
I tried to import numpy.nan
to replace the string nan
above. But it doesn't work.
What am I trying to find - is all the cells which are not 'S', 'C', 'Q'.
Also realised later that.... the nan
is a Float type using type(df.Embarked.unique()[-1])
. Could someone help me understand how to identify those nan
cells?
Solution
NaN
is used to represent missing values.
- To find them, use
.isna()
Detect missing values.
- To replace them, use
.fillna(value)
Fill NA/NaN values
Some examples on a series called col
:
>>> col
0 1.0
1 NaN
2 2.0
dtype: float64
>>> col[col.isna()]
1 NaN
dtype: float64
>>> col.index[col.isna()]
Int64Index([1], dtype='int64')
>>> col.fillna(-1)
0 1.0
1 -1.0
2 2.0
dtype: float64
Note that you can’t compare equality with nan
as by definition it’s not equal to anything, not even itself:
>>> np.nan == np.nan
False
This is likely the property that is used to identify nan
under the hood:
>>> col != col
0 False
1 True
2 False
dtype: bool
But it’s better (more readable) to use the pandas functions than to test for inequality yourself.
Answered By - Cimbali
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.