Issue
I have two dataframes df1 and df2 with the same set of columns. I merged 2 dfs together and want to
- calculate the score diff for score1 and score2
- see if the region column is the same.
The desired result would be having 'score1_diff', 'socre2_diff' and 'regional_diff' in the df_final (as shown in the attached picture). I created 'score1_diff', 'score2_diff' and 'regional_diff' columns in the df_final using the codes shown below.
However, in my read dataframe, I have over 30+ score columns and over 10+ region columns, there will be more score columns and region columns being added to the dataframe from time to time. In stead of creating these columns one by one, what would be an efficient or dynamic way to achieve the same result?
import pandas as pd
pd.set_option('display.max_columns', None)
df1 = { 'Name':['George','Andrea','micheal','Ann',
'maggie','Ravi','Xien','Jalpa'],
'region':['a','a','a','a','b','b','b','b'],
'score1':[63,42,55,70,38,77,86,99],
'score2':[45,74,44,89,69,49,72,98]}
df2 = { 'Name':['George','Andrea','micheal', 'Matt',
'maggie','Ravi','Xien','Jalpa'],
'region':['a','b','a','a','a','b','b','a'],
'score1':[62,47,55,74,32,77,86,77],
'score2':[45,78,44,89,66,49,72,73]}
df1=pd.DataFrame(df1)
df2=pd.DataFrame(df2)
df_all = pd.merge(df1,df2,how='outer',indicator=True, on='Name',suffixes=('_df1','_df2'))
df_final=df_all.copy()
df_final['score1_diff']=df_final['score1_df1']-df_final['score1_df2']
df_final['score2_diff']=df_final['score2_df1']-df_final['score2_df2']
df_final['regional_diff']=df_final['region_df1']==df_final['region_df2']
Solution
By taking Jonathan's idea, below is the solution that works for me. The general idea is to create 2 df with all columns which have the suffix of _df1 and _df2, respectively, then replace the suffix with _diff in order to substract one df from another.
This approach is identical to Jonathan's solution only except I filter all columns with specific naming convention rather than using index with iloc.
df_final=df_all.copy()
num_current= [itm + '_df1' for itm in num_col]
num_previous =[itm + '_df2' for itm in num_col]
dfx = df_final[num_current]
dfx.columns = dfx.columns.str.replace('_df1', '_diff')
dfy = df_final[num_previous]
dfy.columns = dfy.columns.str.replace('_df2', '_diff')
df_diff = dfx - dfy
df_all = pd.concat([df_final, df_diff], axis=1)
Answered By - user032020
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.