Issue
I have a pandas dataframe like this:
thing1 thing2 num_col1 num_col2 num_col3 num_col4
aaa abc 0.0 99.76 101.1 111.2
bbb cde 11.3 109.76 201.1 121.2
ccc def 91.3 0.0 301.1 131.2
.....
.....
aaa efg 0.0 0.0 401.1 141.2
bbb fgh 41.3 299.76 0.0 151.2
ccc ghi 201.3 199.76 601.1 161.2
and I want to convert it to this:
thing1 thing2 num_col1 num_col2 num_col3 num_col4
aaa abc nan 99.76 101.1 111.2
bbb cde 11.3 109.76 201.1 121.2
ccc def 91.3 0.0 301.1 131.2
.....
.....
aaa efg nan nan 401.1 141.2
bbb fgh 41.3 299.76 0.0 151.2
ccc ghi 201.3 199.76 601.1 161.2
explanation: if, values of columns num_col1
, num_col2
, num_col3
and num_col4
start with 0, those 0s need to be converted to numpy.nan
and any 0s in the middle of the columns shouldn't be touched. How do I do this as efficiently as possible (by using any builtin funcitons and not using python's loops etc,.)?
Solution
To get the starting 0, it usually involves cumsum on the negate condition. Something like this:
mask = (df.filter(like='num_') # extract numeric data, modify if needed
.ne(0) # where the numbers are not 0
.cumsum(axis=1) # these shows the blocks of `0`
.eq(0) # rows starting with zero
)
Then you would have the mask as:
num_col1 num_col2 num_col3 num_col4
0 True False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 True True False False
6 False False False False
7 False False False False
Finally, you can use mask
to mask your data:
df[mask.columns] = df[mask.columns].mask(mask)
And you get (notice the 0
at line 5
, num_col4
):
thing1 thing2 num_col1 num_col2 num_col3 num_col4
0 aaa abc NaN 99.76 101.1 111.2
1 bbb cde 11.3 109.76 201.1 121.2
2 ccc def 91.3 0.00 301.1 131.2
3 ..... None NaN NaN NaN NaN
4 ..... None NaN NaN NaN NaN
5 aaa efg NaN NaN 401.1 0.0
6 bbb fgh 41.3 299.76 0.0 151.2
7 ccc ghi 201.3 199.76 601.1 161.2
Answered By - Quang Hoang
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.