Wednesday, October 26, 2022

[FIXED] Update pandas cell in one dataframe from looked up value in second dataframe

October 26, 2022 dataframe, pandas, python No comments

Issue

Update pandas cell in one Dataframe from looked up value in second Dataframe

I have a case where I need to update a cell in one Dataframe, 'Stock', which holds records of stock on-hand, looking up its value in a second Dataframe, 'Items', which is the table of all items. This is a simplified example of the Dataframes with relevant fields.

Stock

    item_no qty
0   9H.111  101
1   9H.222  230
2   MODEL_B 136
3   9H.444  344
4   MODEL_E 505
5   9H.666  332

Items

    item_no model_no
0   9H.111  MODEL_A
1   9H.222  MODEL_B
2   9H.333  MODEL_B
3   9H.444  MODEL_C
4   9H.555  MODEL_D
5   9H.666  MODEL_E
6   9H.777  MODEL_D
7   9H.888  MODEL_F

The challenge

I have previously done this in PostgreSQL but would like to see if I can do all the processing in Pandas (I plan to link to the PostgreSQL table of items). If we look at the Stock table the item_no column should only have item numbers, see Items Dataframe (table), but sometimes the users put in the model_no instead of the item number. So in the Stock dataframe, row 2 incorrectly has the value MODEL_B.

What's needed

What is needed to be done is to:

get the value MODEL_B from the item_no column in the Stock dataframe
find that in the model_no column of the Items dataframe
then get the value from the item_no field of the Items dataframe
use that value to replace the (incorrect) model number value in the item_no column of the Stock dataframe

It gets a little more challenging... a model may have more than one part number:

1   9H.222  MODEL_B
2   9H.333  MODEL_B

In this case the 'highest' part number, in this case 9H.333, needs to be used. In SQL I use the MAX() operator.

I would like to perform this using 'set' operations in pandas (not looping), similar to running a query in SQL. So this would mean (?) joining the two dataframes on the fields stock.item_no <-> items.model_no (?) - I'm not sure how to go about it hence the question marks.

Generate Dataframes

This code will generate the dataframes discussed above.

stock = pd.DataFrame({
    'item_no': ['9H.111', '9H.222', 'MODEL_B', '9H.444', 'MODEL_E', '9H.666'],
    'qty': [101, 230, 136, 344, 505, 332],
})
items = pd.DataFrame({
    'item_no': ['9H.111', '9H.222', '9H.333', '9H.444', '9H.555', '9H.666', '9H.777', '9H.888'],
    'model_no': ['MODEL_A', 'MODEL_B', 'MODEL_B', 'MODEL_C', 'MODEL_D', 'MODEL_E', 'MODEL_D', 'MODEL_F']
})

display(stock)
display(items)

Solution

You can use:

# keep max item_no per model_no
# convert to mapping Series
mapper = (items.sort_values(by='item_no')
               .drop_duplicates(subset='model_no', keep='last')
               .set_index('model_no')['item_no']
         )

# identify rows with model_no
m = stock['item_no'].isin(mapper.index)

# replace the values in place
stock.loc[m, 'item_no'] = stock.loc[m, 'item_no'].map(mapper)

updated stock:

  item_no  qty
0  9H.111  101
1  9H.222  230
2  9H.333  136
3  9H.444  344
4  9H.666  505
5  9H.666  332

Answered By - mozway

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0