Issue
Let's say I have the following dataframe:
df = pd.DataFrame({'name':['john','mary','peter','jeff','bill'], 'matched_name':['mary','john','jeff','lisa','jose'], 'ratio':[78, 78, 22, 19, 45]})
print(df)
name matched_name ratio
0 john mary 78
1 mary john 78
2 peter jeff 22
3 jeff lisa 19
4 bill jose 45
I want to remove duplicated rows based on condition: if columns name
and matched
after exchange their cell place are same values and ratio
also same then those rows are considered as duplicated rows.
Under above rules, row 0
and row 1
are duplicates, so I will keep only row 0
. How could I do it use Pandas? Thanks.
This is expected result:
name matched ratio
0 john mary 78
1 peter jeff 22
2 jeff lisa 19
3 bill jose 45
Solution
Use np.sort
for sorting values per rows, add column ratio
and test duplicates by DataFrame.duplicated
, last filter by inverse mask by ~
by boolean indexing
:
m = (pd.DataFrame(np.sort(df[['name', 'matched_name']], axis=1), index=df.index)
.assign(ratio=df['ratio'])
.duplicated())
df = df[~m]
print (df)
name matched_name ratio
0 john mary 78
2 peter jeff 22
3 jeff lisa 19
4 bill jose 45
Answered By - jezrael
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.