Issue
I am trying to consolidate 2 columns to prevent duplicates. From this question, I am able to do it for one column like so:
df = df.groupby(['Date', 'Ticket ID', 'Score', 'many other Columns...'])['Work Order'].apply(', '.join).reset_index()
Which gives me a nice output of:
Date Ticket ID Work Order
2018-08-30 22:52:25 1444008 119846184
2021-09-29 13:33:49 1724734 122445397, 122441551
Now, I have at least one other column that I would like to do the same too, while keeping the Work Orders correct. However, repeating the process/code for a different column either seems to drop the original or not aggregate the columns like I want. I want:
Date Ticket ID Work Order Num Work Order ID
2018-08-30 22:52:25 1444008 119846184 A6zH
2021-09-29 13:33:49 1724734 122445397, 122441551 H43a1, JU8a
I have tried a couple different things with the reset_index
and merge
functions but cannot seem to get it to do what I want. This seems like it should be pretty simple - what am I missing? It is very possible that I will have to do this for other columns as well, so I would like a repeatable solution.
# tried some df1 and df2 stuff - along with different locations of reset index.
df = df.groupby(['Reported Date', 'Site','Ticket ID', 'TicketUID', 'Work Order UID', \
'cols...'])['Work Order'].apply(', '.join)
df = df.groupby(['Reported Date', 'Site','Ticket ID', 'TicketUID', 'Work Order', \
'cols...'])['Work Order UID'].apply(', '.join)
# tried on='Ticket ID' here as well.
#df = pd.merge(df1, df2)
Solution
apply
works differently on DataFrameGroupBy
(multiple grouped columns) and on SeriesGroupBy
(single grouped column). For single column it works on elements, for multiple columns it applies function to each column.
So one way to do what you want to do is simply do apply twice - f.e.:
data = {'id':[3763058, 3763058, 3763058, 3763077, 3763077, 3763078],
'id2':[3763056, 3763056, 3763056, 3763074, 3763074, 3763073],
'item1' : ['item1', 'item2', 'item3', 'item4', 'item5', 'item6'],
'item2' : ['itemA', 'itemB', 'itemC', 'itemD', 'itemE', 'itemF']}
df = pd.DataFrame(data)
df.groupby(['id', 'id2'])[['item1', 'item2']]\
.apply(lambda x: x.apply(', '.join))
.reset_index()
Which gives us result:
id id2 item1 item2
0 3763058 3763056 item1, item2, item3 itemA, itemB, itemC
1 3763077 3763074 item4, item5 itemD, itemE
2 3763078 3763073 item6 itemF
Answered By - Daniel Wlazło
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.