Issue
In the following example, I have been trying to drop the duplicate that is == false. I mean when id and year match with more than 1 row (subset =[id, year])
df = pd.DataFrame({'id': ['1', '1', '1', '2', '2', '3', '4', '4'],
'Year': [2000, 2000, 2003, 2004, 2004, 2002, 2001, 2003], 'Boolean':['false', 'true', 'true', 'true', 'false', 'true', 'true', 'true']})
print(df)
# Output
id Year Boolean
0 1 2000 false
1 1 2000 true
2 1 2003 true
3 2 2004 true
4 2 2004 false
5 3 2002 true
6 4 2001 true
7 4 2003 true
# Attempted code
df2 = df.loc[df['Year'].eq('false').groupby([df.id]).idxmax()]
print(df2)
id Year Boolean
0 1 2000 false
3 2 2004 true
5 3 2002 true
6 4 2001 true
This code is incorrect as I am trying to keep the Boolean == false observations when there is a duplicate. It is also dropping other observations that are not duplicates. Not sure if it is easier to use the duplicate function for this.
Solution
"I have been trying to drop the duplicate that is == false. I mean when id and year match with more than 1 row (subset =[id, year])" -> so you need to group by both ID and Year (and use the correct column as boolean source):
df.loc[df['Boolean'].eq('false').groupby([df['id'], df['Year']]).idxmax()]
output:
id Year Boolean
0 1 2000 false
2 1 2003 true
4 2 2004 false
5 3 2002 true
6 4 2001 true
7 4 2003 true
booleans
Note that it would be much better to use real booleans (faster, less memory, shorter syntax…)
# convert to booleans
df['Boolean'] = df['Boolean'].eq('true')
df.loc[df.groupby(['id', 'Year'])['Boolean'].idxmin()]
id Year Boolean
0 1 2000 False
2 1 2003 True
4 2 2004 False
5 3 2002 True
6 4 2001 True
7 4 2003 True
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.