Tuesday, June 14, 2022

[FIXED] efficiently convert leading zeroes in each row to numpy.nan in pandas

June 14, 2022 numpy, pandas, python No comments

Issue

I have a pandas dataframe like this:

   thing1  thing2   num_col1    num_col2   num_col3    num_col4
    aaa     abc      0.0         99.76     101.1        111.2
    bbb     cde      11.3        109.76    201.1        121.2
    ccc     def      91.3        0.0       301.1        131.2
   .....
   .....
    aaa     efg      0.0         0.0       401.1        141.2
    bbb     fgh      41.3        299.76    0.0          151.2
    ccc     ghi      201.3       199.76    601.1        161.2

and I want to convert it to this:

   thing1  thing2   num_col1    num_col2   num_col3    num_col4
    aaa     abc      nan         99.76     101.1        111.2
    bbb     cde      11.3        109.76    201.1        121.2
    ccc     def      91.3        0.0       301.1        131.2
   .....
   .....
    aaa     efg      nan         nan       401.1        141.2
    bbb     fgh      41.3        299.76    0.0          151.2
    ccc     ghi      201.3       199.76    601.1        161.2

explanation: if, values of columns num_col1, num_col2, num_col3 and num_col4 start with 0, those 0s need to be converted to numpy.nan and any 0s in the middle of the columns shouldn't be touched. How do I do this as efficiently as possible (by using any builtin funcitons and not using python's loops etc,.)?

Solution

To get the starting 0, it usually involves cumsum on the negate condition. Something like this:

mask = (df.filter(like='num_') # extract numeric data, modify if needed
          .ne(0)               # where the numbers are not 0
          .cumsum(axis=1)      # these shows the blocks of `0`  
          .eq(0)               # rows starting with zero 
       )

Then you would have the mask as:

   num_col1  num_col2  num_col3  num_col4
0      True     False     False     False
1     False     False     False     False
2     False     False     False     False
3     False     False     False     False
4     False     False     False     False
5      True      True     False     False
6     False     False     False     False
7     False     False     False     False

Finally, you can use mask to mask your data:

df[mask.columns] = df[mask.columns].mask(mask)

And you get (notice the 0 at line 5, num_col4):

  thing1 thing2  num_col1  num_col2  num_col3  num_col4
0    aaa    abc       NaN     99.76     101.1     111.2
1    bbb    cde      11.3    109.76     201.1     121.2
2    ccc    def      91.3      0.00     301.1     131.2
3  .....   None       NaN       NaN       NaN       NaN
4  .....   None       NaN       NaN       NaN       NaN
5    aaa    efg       NaN       NaN     401.1       0.0
6    bbb    fgh      41.3    299.76       0.0     151.2
7    ccc    ghi     201.3    199.76     601.1     161.2

Answered By - Quang Hoang

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, June 14, 2022

[FIXED] efficiently convert leading zeroes in each row to numpy.nan in pandas

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels