Issue
I want to get the only row after groupby with the certain rules(
- get the tier with rank = 1 if tier != Unknown. (a001)
- Or get the tier with rank =2 (b001)
- return tier = 'no_tier' if rank = 1 and tier = Unknown (c001)
** "Unknown" would only show up once for the tier of each user_id
Here is the dataframe
user_id tier rank
a001 High 1
a001 Low 2
a001 Unknown 3
b001 Unknown 1
b001 Mid 2
c001 Unknown 1
expected result
tier
High
Mid
no_tier
I was trying to write down a function like
def get_tier(x):
if x['rank'] ==1 and x['tier'] != 'Unknown':
return x['tier']
elif x['rank'] == 2:
return x['tier']
else:
return 'no_tier'
df.goupby('user_id').apply(lambda x : x.apply(get_tier), axis = 1).iloc[0]).reset_index()
but seems like it cant work
Solution
Given your conditions and assuming that the ranks are already sorted in ascending order (per group or globally), you can just replace unknown with NA, get the first non-NA value with groupby.first
, then fillna
with :
out = (df
#.sort_values(by='rank') # uncomment if not originally sorted
['tier']
.replace('Unknown', pd.NA)
.groupby(df['user_id']).first()
.fillna('no_tier')
)
Output:
user_id
a001 High
b001 Mid
c001 no_tier
Name: tier, dtype: object
If you rather want the original rows, you can go with a groupby.idxmin
to the the min rank per group after masking the Unknown ranks with large number (e.g. Inf):
out = (df.loc[df['rank'].mask(df['tier'].eq('Unknown'), float('inf'))
.groupby(df['user_id']).idxmin()]
.replace({'tier': {'Unknown': 'no_tier'}})
)
Output:
user_id tier rank
0 a001 High 1
4 b001 Mid 2
5 c001 no_tier 1
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.