Issue
I am currently merging two dataframes with an outer join. However, after merging, I see all the rows are duplicated even when the columns that I merged upon contain the same values.
Specifically, I have the following code.
merged_df = pd.merge(df1, df2, on=['email_address'], how='inner')
Here are the two dataframes and the results.
df1
email_address name surname
0 john.smith@email.com john smith
1 john.smith@email.com john smith
2 elvis@email.com elvis presley
df2
email_address street city
0 john.smith@email.com street1 NY
1 john.smith@email.com street1 NY
2 elvis@email.com street2 LA
merged_df
email_address name surname street city
0 john.smith@email.com john smith street1 NY
1 john.smith@email.com john smith street1 NY
2 john.smith@email.com john smith street1 NY
3 john.smith@email.com john smith street1 NY
4 elvis@email.com elvis presley street2 LA
5 elvis@email.com elvis presley street2 LA
My question is, shouldn't it be like this?
This is how I would like my merged_df
to be like.
email_address name surname street city
0 john.smith@email.com john smith street1 NY
1 john.smith@email.com john smith street1 NY
2 elvis@email.com elvis presley street2 LA
Are there any ways I can achieve this?
Solution
list_2_nodups = list_2.drop_duplicates()
pd.merge(list_1 , list_2_nodups , on=['email_address'])
The duplicate rows are expected. Each john smith in list_1
matches with each john smith in list_2
. I had to drop the duplicates in one of the lists. I chose list_2
.
Answered By - piRSquared
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.