Sunday, January 16, 2022

[FIXED] How to compare the difference of two sets of columns in a dataframe in an efficient and dynamic way

January 16, 2022 pandas, python, python-3.x No comments

Issue

I have two dataframes df1 and df2 with the same set of columns. I merged 2 dfs together and want to

calculate the score diff for score1 and score2
see if the region column is the same.

The desired result would be having 'score1_diff', 'socre2_diff' and 'regional_diff' in the df_final (as shown in the attached picture). I created 'score1_diff', 'score2_diff' and 'regional_diff' columns in the df_final using the codes shown below.

However, in my read dataframe, I have over 30+ score columns and over 10+ region columns, there will be more score columns and region columns being added to the dataframe from time to time. In stead of creating these columns one by one, what would be an efficient or dynamic way to achieve the same result?

import pandas as pd 
pd.set_option('display.max_columns', None)

df1 = { 'Name':['George','Andrea','micheal','Ann',
            'maggie','Ravi','Xien','Jalpa'],
   'region':['a','a','a','a','b','b','b','b'],
    'score1':[63,42,55,70,38,77,86,99],
    'score2':[45,74,44,89,69,49,72,98]}

df2 = { 'Name':['George','Andrea','micheal', 'Matt',
            'maggie','Ravi','Xien','Jalpa'],
   'region':['a','b','a','a','a','b','b','a'],
    'score1':[62,47,55,74,32,77,86,77],
    'score2':[45,78,44,89,66,49,72,73]}

df1=pd.DataFrame(df1)

df2=pd.DataFrame(df2)

df_all = pd.merge(df1,df2,how='outer',indicator=True, on='Name',suffixes=('_df1','_df2'))

df_final=df_all.copy()
df_final['score1_diff']=df_final['score1_df1']-df_final['score1_df2']
df_final['score2_diff']=df_final['score2_df1']-df_final['score2_df2']
df_final['regional_diff']=df_final['region_df1']==df_final['region_df2']

Thanks

Solution

By taking Jonathan's idea, below is the solution that works for me. The general idea is to create 2 df with all columns which have the suffix of _df1 and _df2, respectively, then replace the suffix with _diff in order to substract one df from another.

This approach is identical to Jonathan's solution only except I filter all columns with specific naming convention rather than using index with iloc.

df_final=df_all.copy()

num_current= [itm + '_df1' for itm in num_col]
num_previous =[itm + '_df2' for itm in num_col]


dfx = df_final[num_current]
dfx.columns = dfx.columns.str.replace('_df1', '_diff')

dfy = df_final[num_previous]
dfy.columns = dfy.columns.str.replace('_df2', '_diff')

df_diff = dfx - dfy

df_all = pd.concat([df_final, df_diff], axis=1)

Answered By - user032020

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 16, 2022

[FIXED] How to compare the difference of two sets of columns in a dataframe in an efficient and dynamic way

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels