Issue
I am working with two tables:
df1: Zone column has say, 3 categorical levels whereas the weight column is a continuous column.
id | Zone | Weight |
---|---|---|
1 | A | 4 |
2 | B | 14 |
3 | C | 25 |
4 | A | 8.5 |
5 | B | 10 |
df2: The first 2 columns are lower bounds and upper bounds of the same Weight column in df1. Moreover, the levels of the Zone column of df1 are in df2 as separate columns named Zone A, Zone B & Zone C.
Weight_LB | Weight_UB | Zone A | Zone B | Zone C |
---|---|---|---|---|
1 | 10 | -2.3 | 3.4 | -3.6 |
11 | 20 | -5.3 | 2.8 | -7.4 |
21 | 30 | -6.7 | -1.9 | 2.5 |
Problem Statement:
I want a final dataframe df3 which would look like:
id | Zone | Weight | New_Col |
---|---|---|---|
1 | A | 4 | -2.3 |
2 | B | 14 | 2.8 |
3 | C | 25 | 2.5 |
4 | A | 8.5 | -2.3 |
5 | B | 10 | 3.4 |
Note: It is somewhat similar to a left join but the code I am using currently is not at all efficient if df1 and df2 are very large.
I am using the following code right now:
# defining the two dataframes
df1 = pd.DataFrame({'id':[1,2,3,4,5],
'Zone':['A','B','C','A','B'],
'Weight':[4,14,25,8.5,10]})
df2 = pd.DataFrame({'Weight_LB':[1,11,21],
'Weight_UB':[10,20,30],
'A':[-2.3,-5.3,-6.7],
'B':[3.4,2.8,-1.9],
'C':[-3.6,-7.4,2.5]})
# unpivoting the df2 so that matching on conditions becomes easier
df2_melted = pd.melt(df2, id_vars=['Weight_LB','Weight_UB'], var_name = 'Zone', value_name = 'new_col')
# matching conditions on zone and weight between df1 and df2
new_col = {} #dictionary to store id-new column value pair
for index,row in df1.iterrows():
for i,r in df2_melted.iterrows(): # nested for loops to compare conditions between df1 and df2
condition = row['Zone']==r['Zone'] and r['Weight_LB']<=row['Weight'] and row['Weight']<=r['Weight_UB'] # conditions on zone and weight that need to be matched
if condition:
new_col[row['id']] = r['new_col'] # storing the values in the dictionary
new_col_df = pd.DataFrame(new_col.items(), columns=['id', 'new_col']) # converting the dictionary to a dataframe
# final resultant df3, with the left join being performed on id column to have the new column values mapped
df3 = pd.merge(df1,new_col_df,how='left',on='id')
df3
However, the above code is not very efficient due to obvious reasons when it comes to large datasets.
Is there a more pythonic way to do this? Any help is appreciated! Thanks in advance...
Solution
As your range seems to be consecutive, you can use merge_asof
based on Weight_UB
(or Weight_LB
, it doesn't matter, you just have to change the direction):
out = pd.merge_asof(
df1.sort_values('Weight'),
df2_melted.sort_values('Weight_UB').astype({'Weight_UB': float}),
left_on='Weight', right_on='Weight_UB', by='Zone', direction='forward'
)[df1.columns.tolist() + ['new_col']]
Output:
>>> out.sort_values('id', ignore_index=True)
id Zone Weight new_col
0 1 A 4.0 -2.3
1 2 B 14.0 2.8
2 3 C 25.0 2.5
3 4 A 8.5 -2.3
4 5 B 10.0 3.4
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.