Saturday, July 2, 2022

[FIXED] Pandas check if row exist in another dataframe and append index

July 02, 2022 pandas, python No comments

Issue

I'm having one problem to iterate over my dataframe. The way I'm doing is taking a long time and I don't have that many rows (I have like 300k rows)

What am I trying to do?

Check if one DF (A) contains the value of two columns of the other DF (B). You can think of this as a multiple-key field
If True, get the index of DF.B and assign to one column of DF.A
If False, two steps:

a. append to DF.B the two columns not found

b. assign the new ID to DF.A (I couldn't do this one)

This is my code, where:

df is DF.A and df_id is DF.B:
SampleID and ParentID are the two columns I am interested to check if they exist in both dataframes
Real_ID is the column to which I want to assign the id of DF.B (df_id)

for index, row in df.iterrows():
    #check if columns exist in the other dataframe
    real_id = df_id[(df_id['SampleID'] == row['SampleID']) & (df_id['ParentID'] == row['ParentID'])]
    
    if real_id.empty:
        #row does not exist, append to df_id
        df_id = df_id.append(row[['SampleID','ParentID']])
    else:
        #row exists, assign id of df_id to df
        row['Real_ID'] = real_id.index

EXAMPLE:

DF.A (df)

   Real_ID   SampleID   ParentID  Something AnotherThing
0             20          21          a          b      
1             10          11          a          b      
2             40          51          a          b

DF.B (df_id)

   SampleID   ParentID  
0    10          11         
1    20          21

Result:

   Real_ID   SampleID   ParentID  Something AnotherThing
0      1      10          11          a          b      
1      0      20          21          a          b      
2      2      40          51          a          b      


   SampleID   ParentID  
0    20          21         
1    10          11    
2    40          51

Again, this solution is very slow. I'm sure there is a better way to do this and that's why I'm asking here. Unfortunately this was what I got after some hours...

Thanks

Solution

you can do it this way:

Data (pay attention at the index in the B DF):

In [276]: cols = ['SampleID', 'ParentID']

In [277]: A
Out[277]:
   Real_ID  SampleID  ParentID Something AnotherThing
0      NaN        10        11         a            b
1      NaN        20        21         a            b
2      NaN        40        51         a            b

In [278]: B
Out[278]:
   SampleID  ParentID
3        10        11
5        20        21

Solution:

In [279]: merged = pd.merge(A[cols], B, on=cols, how='outer', indicator=True)

In [280]: merged
Out[280]:
   SampleID  ParentID     _merge
0        10        11       both
1        20        21       both
2        40        51  left_only


In [281]: B = pd.concat([B, merged.ix[merged._merge=='left_only', cols]])

In [282]: B
Out[282]:
   SampleID  ParentID
3        10        11
5        20        21
2        40        51

In [285]: A['Real_ID'] = pd.merge(A[cols], B.reset_index(), on=cols)['index']

In [286]: A
Out[286]:
   Real_ID  SampleID  ParentID Something AnotherThing
0        3        10        11         a            b
1        5        20        21         a            b
2        2        40        51         a            b

Answered By - MaxU - stop genocide of UA

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, July 2, 2022

[FIXED] Pandas check if row exist in another dataframe and append index

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels