Issue
i have 2 dataframes have same columns with different len.
in [1] : df_g = pd.DataFrame([['EDC_TRAING_NO', 'EDU_T_N', 'NUMVER', '20'],
['EDC_TRAING_NAME', 'EDU_T_NM', 'VARCHAR', '40'],
['EDC_TRAING_ST', 'EDU_T_SD', 'DATETIME', '15'],
['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
['EDC_PLACE_NM', 'EDU_P_NM', 'VARCHAR2', '40'],
['ONLINE_REQST_POSBL_AT', 'ONLINE_R_P_A', 'VARCHAR2', '1']],
columns=['NAME', 'ID', 'TYPE', 'LEN'])
in [2] : df_n = pd.DataFrame([['EDC_TRAING_NO', 'EDU_TR_N', 'NUMVER', '20'],
['EDC_TRAING_NAME', 'EDU_TR_NM', 'VARCHAR', '20'],
['EDC_TRAING_ST', 'EDU_TR_SD', 'DATETIME', '15'],
['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
['EDC_PLACE_NM', 'EDU_PL_NM', 'VARCHAR2', '40'],
['ONLINE_REQST_POSBL_AT', 'ONLINE_REQ_P_A', 'VARCHAR2', '1']],
columns=['NAME', 'ID', 'TYPE', 'LEN'])
the reuslt i want to get:
result = pd.DataFrame([['EDC_TRAING_NO', 'EDU_TR_N', 'NUMVER', '20'],
['EDC_TRAING_ST', 'EDU_TR_SD', 'DATETIME', '15'],
['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
['EDC_PLACE_NM', 'EDU_PL_NM', 'VARCHAR2', '40'],
['ONLINE_REQST_POSBL_AT', 'ONLINE_REQ_P_A', 'VARCHAR2', '1']],
columns=['NAME', 'ID', 'TYPE', 'LEN'])
and each df have length like this.
len(df_g) : 1000
len(df_n) : 5000
each dataframe has column named 'name, id, type, len'
i need to check those columns(name,type,len) in each df to compare 'id' column whether it has same value or not.
so i tried like this.
for i in g.index:
for j in n.index:
g = g.iloc[i].values
# make it to ndarray
g_Str = g[0] + g[2] + g[3]
# make it to str for pivot
n = n.iloc[j].values
n_Str = n[0] + str(n[2]) + str(n[3])
# comparing and check two df
if g_Str == n_Str and g[1] != n[1]:
print(i, j)
print(g[0])
I have above code for 2 different length DataFrame. first i tried with 'iterrows()' for comparing those two df, but it took too much time.(very slow)
i looked up for other ways to make it work better performance.
possible ways i found
option1 transform df to dict with to_dict() / compare those in nested for-loop
option2 transform df.series to ndarray / compare those in nested for-loop
is there any other better option? or any option to not using nested for-loop?
thx.
Solution
you can try merge, and if you are looking for records where ids do mismatch then the following is one way of achieving it:
r1=df_g.merge(df_n,on=['NAME', 'TYPE', 'LEN'],how='inner').query('ID_x != ID_y').rename(columns={'ID_x': 'ID'}).drop('ID_y', 1)
I have used how="inner" join, but based on need can use any of the following joins: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
Answered By - simpleApp
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.