Issue
Fellow contributors,
I would like to check if a set of specific key words exists on a grouped pandas DataFrame. The words I would like to check are start
, pending
and either finished
or almost_finished
. I would like to define a custom function for this and apply
it to pandas groupby
as defining a function to apply on columns is a bit not clear for me comparing to rowwise operations where we address every row with (row[colname]).
In this example if the sequence of the desired words exist I would like the last value in column number
for each ID
to be copied in a new column and it doesn't matter if other values before that are empty strings. Here is a reproducible example:
import pandas as pd
df = pd.DataFrame({'ID' : [1100, 1100, 1100, 1200, 1200, 1200, 1300, 1300],
'number' : ['Yes', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'No'],
'status' : ['start', 'pending', 'finished', 'start', 'pending', 'partially_finished', 'start', 'pending']})
In this case the last group of ID == 1300
has no return value.
Basically I am asking this question to learn the best approach for these kinda problem where you need to check some values in a column, since I am coming from R I need to familiarize myself with the way I would do the same thing in Python. I would also appreciate any better solution you may suggest.
Thank you very much in advance.
Solution
You can aggregate with set
and use intersection
to check.
But first, I would map partially_finished
or almost_finished
to finished
, if these should be treated equally.
df['status'] = df.status.replace('partially_finished|almost_finished', 'finished', regex=True)
Next, aggregate number
to last value and status
to set
, then I use intersect
to check if all values are existing in status
.
checkcriteria = {'start', 'pending', 'finished'}
df = df.groupby('ID').agg({'number': 'last', 'status': set})
df['check'] = df.status.transform(lambda x: len(x.intersection(checkcriteria)) == 3)
This should give a result,
number status check
ID
1100 No {start, pending, finished} True
1200 No {start, pending, finished} True
1300 No {start, pending} False
You can either filter by check
or mask
and remove the value for number
.
# This will only return ID == 1100, 1200
df[df.check]
# OR mask to remove the number value for when check == False
df.loc[~df.check, 'number'] = None
Answered By - Emma
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.