Issue
I have a dataframe df2 with one of the column with comma separated values. I would like to join df2 with another dataframe df1 using label first and if match not found then use synonym column. The synonym column can also have more than 2 values.
df1
label
Werner syndrome
Ageing
df2
id label synonym
1 Werner's syndrome Werner syndrome|Werner disease
2 Ageing
Expected output is:
df
label id label synonym
Werner syndrome 1 Werner's syndrome Werner syndrome|Werner disease
Ageing 2 Ageing
Any help is highly appreciated.
Solution
You can achieve this by first expanding the synonym column in df2 into multiple rows, then performing a join operation on df1 and df2 using both label and synonym columns.
Here's how you can do it:
import pandas as pd
# Assuming df1 and df2 are your dataframes
# Split the synonym column into multiple rows
df2 = df2.assign(synonym=df2['synonym'].str.split('|')).explode('synonym')
# Perform the join operation
df = pd.merge(df1, df2, how='left', left_on='label', right_on='synonym')
# If label from df1 is not found in synonym, try to join on label from df2
df.loc[df['id'].isna(), 'id':'synonym'] = df1.merge(df2, how='left', on='label').loc[df['id'].isna(), 'id':'synonym']
# Reset the index
df.reset_index(drop=True, inplace=True)
In this code, df2.assign(synonym=df2['synonym'].str.split('|')).explode('synonym') splits the synonym column into multiple rows. Then, pd.merge(df1, df2, how='left', left_on='label', right_on='synonym') performs a left join on df1 and df2 using the label column from df1 and the synonym column from df2. If a label from df1 is not found in synonym, it tries to join on label from df2. Finally, df.reset_index(drop=True, inplace=True) resets the index of the resulting dataframe.
Answered By - Luc SIGIER
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.