Issue
Suppose I have the following Dataframe
final raw act wc Start Finish
abc xyz 30 M5 17-01-2022 06:00 14-07-2031 02:36
abc xyz 40 F4 17-01-2022 06:00 14-07-2031 02:36
abc xyz 50 F6 17-01-2022 06:00 14-07-2031 02:36
abc xyz 60 F8 17-01-2022 06:00 14-07-2031 02:36
abc pqr 40 M14S 17-01-2022 06:00 18-01-2026 17:21
abc pqr 50 M12 17-01-2022 06:00 18-01-2026 17:21
abc pqr 60 M14S 17-01-2022 06:00 18-01-2026 17:21
abc pqr 20 F3 17-01-2022 06:00 14-07-2031 02:36
abc pqr 40 F4 17-01-2022 06:00 14-07-2031 02:36
abc pqr 50 F6 17-01-2022 06:00 14-07-2031 02:36
I would like to take the two rows from here, one is
abc xyz 50 F6 17-01-2022 06:00 14-07-2031 02:36
another one is
abc pqr 50 F6 17-01-2022 06:00 14-07-2031 02:36
The logic would be that for each raw, pick up the next row where wc is either F3, F4 and the act is maximum. Here for xyz, F4 is there, so the next row also for pqr, both F3, F4 are there but maximum act is 40.
I did it using pd.shift()
dft = dfUno.loc[dfUno['wc'].shift().eq('F4')]
But I would like to see it in a more generic way, may be extracting using iterrows()
. Like, my code is only true for F4. I want to extract the dataframe used for above F4/ F3 also.
Expected outcome for this:
final raw act wc Start Finish
abc xyz 30 M5 17-01-2022 06:00 14-07-2031 02:36
abc xyz 40 F4 17-01-2022 06:00 14-07-2031 02:36
abc pqr 40 M14S 17-01-2022 06:00 18-01-2026 17:21
abc pqr 50 M12 17-01-2022 06:00 18-01-2026 17:21
abc pqr 60 M14S 17-01-2022 06:00 18-01-2026 17:21
abc pqr 20 F3 17-01-2022 06:00 14-07-2031 02:36
abc pqr 40 F4 17-01-2022 06:00 14-07-2031 02:36
pls suggest something, how to do it.
Solution
Solution:
ends = df[df.wc.isin(["F3", "F4"])].groupby("raw", sort=False).act.idxmax()
uniqs = ~df.raw.duplicated()
starts = uniqs[uniqs].index
out = df.iloc[np.r_[tuple(map(slice, starts, ends+1))]]
You can first subset the frame to only have rows with wc
equal to "F3"
or "F4"
. Then group by the raw
column to see which index per group gives a maximum act
. These give the "end" locations for the selections:
>>> ends = df[df.wc.isin(["F3", "F4"])].groupby("raw", sort=False).act.idxmax()
>>> ends
raw
xyz 1
pqr 8
Name: act, dtype: int64
Now we need the "start" positions. duplicated
function can be used to detect where each group in raw
column starts:
>>> uniqs = ~df.raw.duplicated()
>>> uniqs
0 True
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 False
Name: raw, dtype: bool
# get the locations of True's
>>> starts = uniqs[uniqs].index
>>> starts
Int64Index([0, 4], dtype='int64')
Now all we need to do is generate a dynamic slicer for the groups. For this, we map
the starts & ends with slice
object constructor and then pass it to np.r_
as a tuple to keep them altogether:
# +1 so that we include endpoints
>>> slices = tuple(map(slice, starts, ends+1))
>>> slices
(slice(0, 2, None), slice(4, 9, None))
>>> rows_to_choose = np.r_[slices]
>>> rows_to_choose
array([0, 1, 4, 5, 6, 7, 8])
So the end result arises with iloc
now
>>> df.iloc[rows_to_choose]
final raw act wc Start Finish
0 abc xyz 30 M5 17-01-2022 06:00 14-07-2031 02:36
1 abc xyz 40 F4 17-01-2022 06:00 14-07-2031 02:36
4 abc pqr 40 M14S 17-01-2022 06:00 18-01-2026 17:21
5 abc pqr 50 M12 17-01-2022 06:00 18-01-2026 17:21
6 abc pqr 60 M14S 17-01-2022 06:00 18-01-2026 17:21
7 abc pqr 20 F3 17-01-2022 06:00 14-07-2031 02:36
8 abc pqr 40 F4 17-01-2022 06:00 14-07-2031 02:36
Note 1: The sort=False
of groupby
assures raw
isn't sorted whilst grouping, otherwise we'd get the end position of the row with "pqr"
first in ends
since "pqr" < "xyz"
, and that'd conflict with starts
' ordering.
Note 2: The above assumes you have a RangeIndex, i.e., 0..N-1 kind of index. If this is not the case, you can copy it before and then put it back, i.e.,
orig_index = df.index
new_df = df.reset_index(drop=True)
... # above operations with `new_df`
out.index = orig_index[out.index]
Answered By - Mustafa Aydın
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.