Issue
I have a pandas DataFrame where I used groupby.ngroup() to identify groups of related data (basically duplicated data, but not exactly because that would have been too easy...).
DisID | BunchData | GroupID |
---|---|---|
1000 | xyz | 1 |
2012 | abc | 2 |
2014 | abc | 2 |
3000 | def | 3 |
I am trying to figure out how to remove all rows EXCEPT the max "DisID" within a GroupID, only if there exists more than one row in a GroupID. In this case, the output would look like:
DisID | BunchData | GroupID |
---|---|---|
1000 | xyz | 1 |
2014 | abc | 2 |
3000 | def | 3 |
Thanks!
Updated with a snippet of the actual code that is throwing TypeError on ~s:
docsDF['GroupID'] = docsDF.groupby(['ExecutionTimestamp',Notional','Trade_type']).ngroup()
s = docsDF.sort_values("DisseminationID").drop_duplicates(['GroupID'])
docsDF.drop(s.index[~s],inplace=True)
Error on last line is:
TypeError: bad operand type for unary ~: 'str'
Solution
Let us do sort_values
then drop_duplicates
df = df.sort_values('DisID').drop_duplicates(['GroupID'],keep='last')
Out[170]:
DisID BunchData GroupID
0 1000 xyz 1
2 2014 abc 2
3 3000 def 3
Update for the comments
s = df.sort_values('DisID').duplicated(['GroupID'])
df.drop(s.index[~s],inplace=True)
Out[183]:
DisID BunchData GroupID
2 2014 abc 2
4 2000 xyz 1
5 4024 abc 2
6 4028 abc 2
7 6000 def 3
8 4000 xyz 1
9 8048 abc 2
10 8056 abc 2
11 12000 def 3
Answered By - BENY
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.