Issue
I have a pandas dataframe as below with 5 million rows. Let's say, it is about a product manufactuing and represents stages of product as the output by every batch and sub_batch.
date batch sub_batch output
Jan-22 A A1 new
Feb-22 A A1 new
Mar-22 A A1 wip
Apr-22 A A1 wip
May-22 A A1 wip
Jun-22 A A1 done
Jul-22 A A1 done
Aug-22 A A1 done
Jan-23 A A2 new
Feb-23 A A2 new
Mar-23 A A2 wip
Apr-23 A A2 wip
May-23 A A2 wip
Jun-23 A A2 done
Jul-23 A A2 done
Jan-23 B B1 new
Feb-23 B B1 done
Mar-23 B B1 done
Apr-23 B B1 done
May-23 B B2 new
Jun-23 B B2 halt
Jul-23 B B2 halt
Aug-23 B B2 new
Sep-23 B B2 wip
Oct-23 B B2 wip
Nov-23 B B2 done
Dec-23 B B2 done
. . . .
. . . .
What I am trying to achieve is, once a sub_batch hits "done" then stop there and delete all next "done" status of that sub_batch.
So, in new dataframe, only take 1st occurance of every "done" by every sub_batch and keep all other output data as it is.
Also, it is important to maintain the sort order the same way. A "done" status comes at the end only. There are some sub_batch which are always new or wip and never get "done" which should be as it is. Concern is to only take 1st "done" and move ahead to finding next one and so on.
The result dataframe will look like this:
date batch sub_batch output
Jan-22 A A1 new
Feb-22 A A1 new
Mar-22 A A1 wip
Apr-22 A A1 wip
May-22 A A1 wip
Jun-22 A A1 done
Jan-23 A A2 new
Feb-23 A A2 new
Mar-23 A A2 wip
Apr-23 A A2 wip
May-23 A A2 wip
Jun-23 A A2 done
Jan-23 B B1 new
Feb-23 B B1 done
May-23 B B2 new
Jun-23 B B2 halt
Jul-23 B B2 halt
Aug-23 B B2 new
Sep-23 B B2 wip
Oct-23 B B2 wip
Nov-23 B B2 done
My approach was : I created 2 columns with initial value 0 and assigned them value 1 by iterating over by some logic that I have thought of
df["col1"] = 0
df["col2"] = 0
#make a list with all outputs excluding "done"
ls = list(set(df["output"]))
ls.remove("done")
#Below code will assign value 1 to all outputs except "done", done will be 0 value
df["col1"] = [1 if item in ls else 0 for item in df["output"]]
#Now, for col2,
#Below code will assign value 1 to only 1st occurance of "done" and all others will be 0.
unique_sub_batch = set(df["sub_batch"])
for sub in unique_sub_batch:
for index, item in enumerate(df["output"]):
if df["sub_batch"][index] == sub and item == "done":
df["col2"][index] = 1
break
#Then, I sum these two col1 and col2
df["sum"] = df["col1"] + df["col2"]
#Then, I select where sum=1, meaning I only get 1st occurce of "done" as the loop breaks
#and then continues with finding other values
df = df[df["sum"]==1]
df.drop(columns = {"col1", "col2", "sum"}, inplace = True)
My code works and gives the result after 2 days 😂, could you please help in making it faster by suggesting some alternate way?
Thank you so much for help and sorry for the pain of going through such long question.
Solution
If are done
s only in the end each group and subgroup is possible simplify solution by DataFrame.duplicated
with inverted mask by ~
chained with compare non done
rows:
out = df[df['output'].ne('done') | ~df.duplicated(['batch','sub_batch','output'])]
print (out)
date batch sub_batch output
0 Jan-22 A A1 new
1 Feb-22 A A1 new
2 Mar-22 A A1 wip
3 Apr-22 A A1 wip
4 May-22 A A1 wip
5 Jun-22 A A1 done
8 Jan-23 A A2 new
9 Feb-23 A A2 new
10 Mar-23 A A2 wip
11 Apr-23 A A2 wip
12 May-23 A A2 wip
13 Jun-23 A A2 done
15 Jan-23 B B1 new
16 Feb-23 B B1 done
19 May-23 B B2 new
20 Jun-23 B B2 halt
21 Jul-23 B B2 halt
22 Aug-23 B B2 new
23 Sep-23 B B2 wip
24 Oct-23 B B2 wip
25 Nov-23 B B2 done
If need remove all values after first done
use GroupBy.cummax
and add rows with first done
per groups by batch
and sub_batch
:
m = df['output'].eq('done')
out = df[~m.groupby([df['batch'], df['sub_batch']]).cummax() |
(m & ~df.duplicated(['batch','sub_batch','output']))]
Answered By - jezrael
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.