Issue
I'm having one problem to iterate over my dataframe. The way I'm doing is taking a long time and I don't have that many rows (I have like 300k rows)
What am I trying to do?
Check if one DF (A) contains the value of two columns of the other DF (B). You can think of this as a multiple-key field
If True, get the index of
DF.B
and assign to one column ofDF.A
If False, two steps:
a. append to DF.B
the two columns not found
b. assign the new ID to DF.A
(I couldn't do this one)
This is my code, where:
df is
DF.A
and df_id isDF.B
:SampleID and ParentID are the two columns I am interested to check if they exist in both dataframes
Real_ID is the column to which I want to assign the id of
DF.B
(df_id)
for index, row in df.iterrows():
#check if columns exist in the other dataframe
real_id = df_id[(df_id['SampleID'] == row['SampleID']) & (df_id['ParentID'] == row['ParentID'])]
if real_id.empty:
#row does not exist, append to df_id
df_id = df_id.append(row[['SampleID','ParentID']])
else:
#row exists, assign id of df_id to df
row['Real_ID'] = real_id.index
EXAMPLE:
DF.A (df)
Real_ID SampleID ParentID Something AnotherThing
0 20 21 a b
1 10 11 a b
2 40 51 a b
DF.B (df_id)
SampleID ParentID
0 10 11
1 20 21
Result:
Real_ID SampleID ParentID Something AnotherThing
0 1 10 11 a b
1 0 20 21 a b
2 2 40 51 a b
SampleID ParentID
0 20 21
1 10 11
2 40 51
Again, this solution is very slow. I'm sure there is a better way to do this and that's why I'm asking here. Unfortunately this was what I got after some hours...
Thanks
Solution
you can do it this way:
Data (pay attention at the index in the B
DF):
In [276]: cols = ['SampleID', 'ParentID']
In [277]: A
Out[277]:
Real_ID SampleID ParentID Something AnotherThing
0 NaN 10 11 a b
1 NaN 20 21 a b
2 NaN 40 51 a b
In [278]: B
Out[278]:
SampleID ParentID
3 10 11
5 20 21
Solution:
In [279]: merged = pd.merge(A[cols], B, on=cols, how='outer', indicator=True)
In [280]: merged
Out[280]:
SampleID ParentID _merge
0 10 11 both
1 20 21 both
2 40 51 left_only
In [281]: B = pd.concat([B, merged.ix[merged._merge=='left_only', cols]])
In [282]: B
Out[282]:
SampleID ParentID
3 10 11
5 20 21
2 40 51
In [285]: A['Real_ID'] = pd.merge(A[cols], B.reset_index(), on=cols)['index']
In [286]: A
Out[286]:
Real_ID SampleID ParentID Something AnotherThing
0 3 10 11 a b
1 5 20 21 a b
2 2 40 51 a b
Answered By - MaxU - stop genocide of UA
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.