Issue
I have a Pandas dataframe with a column called EID. It is mostly integers but there are a few non-numeric values in the column. I'm trying to remove them in the middle of a function chain.
Here are some of the errors I get in the middle of my debugging session:
(Pdb) df.dropna(subset=['EID']).query('EID.str.isnumeric()')
*** ValueError: Cannot mask with non-boolean array containing NA / NaN values
(Pdb) df.dropna(subset=['EID']).query('EID.str.isdigit()')
*** ValueError: Cannot mask with non-boolean array containing NA / NaN values
I even tried creating a new column:
(Pdb) df.dropna(subset=['EID']).assign(isnum = lambda x: x.EID.str.isdigit())
but this new column is nothing but NaN
.
How can I remove the rows where this column is non-numeric in the middle of a chain?
Edit: Sample dataset
input:
EID | Name |
---|---|
123 | Madsen,Gunnar |
ret | Greene,Richard |
465 | Stull,Matthew |
Desired output:
EID | Name |
---|---|
123 | Madsen,Gunnar |
465 | Stull,Matthew |
Solution
You can use loc
to perform boolean indexing with a callable that combines pd.to_numeric
and notna
:
out = df.loc[lambda d: pd.to_numeric(d['EID'], errors='coerce').notna()]
Or, if you also want to take the opportunity to convert EID to numeric, assign
and dropna
:
out = (df.assign(EID=lambda d: pd.to_numeric(d['EID'], errors='coerce'))
.dropna(subset=['EID']).convert_dtypes()
)
Output:
EID Name
0 123 Madsen,Gunnar
2 465 Stull,Matthew
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.