Tuesday, May 17, 2022

[FIXED] Checking if a set of strings exist in a column with a custom function

May 17, 2022 dataframe, pandas, python-3.x No comments

Issue

Fellow contributors,

I would like to check if a set of specific key words exists on a grouped pandas DataFrame. The words I would like to check are start, pending and either finished or almost_finished. I would like to define a custom function for this and apply it to pandas groupby as defining a function to apply on columns is a bit not clear for me comparing to rowwise operations where we address every row with (row[colname]). In this example if the sequence of the desired words exist I would like the last value in column number for each ID to be copied in a new column and it doesn't matter if other values before that are empty strings. Here is a reproducible example:

import pandas as pd

df = pd.DataFrame({'ID' : [1100, 1100, 1100, 1200, 1200, 1200, 1300, 1300],
                  'number' : ['Yes', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'No'],
                  'status' : ['start', 'pending', 'finished', 'start', 'pending', 'partially_finished', 'start', 'pending']})

In this case the last group of ID == 1300 has no return value. Basically I am asking this question to learn the best approach for these kinda problem where you need to check some values in a column, since I am coming from R I need to familiarize myself with the way I would do the same thing in Python. I would also appreciate any better solution you may suggest. Thank you very much in advance.

Solution

You can aggregate with set and use intersection to check.

But first, I would map partially_finished or almost_finished to finished, if these should be treated equally.

df['status'] = df.status.replace('partially_finished|almost_finished', 'finished', regex=True)

Next, aggregate number to last value and status to set, then I use intersect to check if all values are existing in status.

checkcriteria = {'start', 'pending', 'finished'}
df = df.groupby('ID').agg({'number': 'last', 'status': set})
df['check'] = df.status.transform(lambda x: len(x.intersection(checkcriteria)) == 3)

This should give a result,

     number                      status  check
ID
1100     No  {start, pending, finished}   True
1200     No  {start, pending, finished}   True
1300     No            {start, pending}  False

You can either filter by check or mask and remove the value for number.

# This will only return ID == 1100, 1200
df[df.check]

# OR mask to remove the number value for when check == False
df.loc[~df.check, 'number'] = None

Answered By - Emma

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, May 17, 2022

[FIXED] Checking if a set of strings exist in a column with a custom function

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels