Issue
EDITED
I have a pandas dataframe like so:
data = {'ID': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'],
'column_1': [0, 0, 0, 0, 0.1, 1, 1.5, 2, 3, 4, 4.5, 5, 4.9, 3, 2, 1.8, 1, 0, 0, 1, 3, 0, 1.3, 2, 3, 4.3, 4.8, 5, 4.2, 3.5, 3, 2.6, 2, 1.9, 1, 0, 0, 0, 0, 0, 0.1, 0.2, 0.3, 1, 2, 3, 5, 4, 2, 0.5, 0, 0],
'column_2': [13, 25, 96, 59, 5, 92, 82, 141, 50, 85, 84, 113, 119, 128, 8, 133, 82, 10, 15, 62, 11, 68, 18, 24, 37, 55, 83, 48, 13, 81, 43, 36, 56, 43, 36, 46, 45, 127, 55, 67, 113, 98, 78, 78, 57, 131, 121, 126, 142, 51, 64, 95]}
ID column_1 column_2
0 A 0.0 13
1 A 0.0 25
2 A 0.0 96
3 A 0.0 59
4 A 0.1 5
5 A 1.0 92
6 A 1.5 82
7 A 2.0 141
8 A 3.0 50
9 A 4.0 85
10 A 4.5 84
11 A 5.0 113
12 A 4.9 119
13 A 3.0 128
14 A 2.0 8
15 A 1.8 133
16 A 1.0 82
17 A 0.0 10
18 A 0.0 15
19 A 1.0 62
20 A 3.0 11
21 A 0.0 68
22 A 1.3 18
23 A 2.0 24
24 A 3.0 37
25 A 4.3 55
26 A 4.8 83
27 A 5.0 48
28 A 4.2 13
29 A 3.5 81
30 A 3.0 43
31 A 2.6 36
32 A 2.0 56
33 A 1.9 43
34 A 1.0 36
35 A 0.0 46
36 A 0.0 45
37 A 0.0 127
38 A 0.0 55
39 A 0.0 67
40 A 0.1 113
41 A 0.2 98
42 B 0.3 78
43 B 1.0 78
44 B 2.0 57
45 B 3.0 131
46 B 5.0 121
47 B 4.0 126
48 B 2.0 142
49 B 0.5 51
50 B 0.0 64
51 B 0.0 95
Tracing back from when the value hits 5
in column_1
, I want to find the value in column_2
just before the value in column_1
increased from 0
and just after it came back down to 0
. So, in the data frame above, the values in column_2
would be 5
, 10
and 18
, 46
.
I want to perform some arithmetic and would like to add 2 columns before
& after
with those values grouped by the ID
column.
The expected output would be:
ID column_1 column_2 Before After
0 A 0.0 13 0 0
1 A 0.0 25 0 0
2 A 0.0 96 0 0
3 A 0.0 59 0 0
4 A 0.1 5 0 0
5 A 1.0 92 0 0
6 A 1.5 82 0 0
7 A 2.0 141 0 0
8 A 3.0 50 0 0
9 A 4.0 85 0 0
10 A 4.5 84 0 0
11 A 5.0 113 5 10
12 A 4.9 119 0 0
13 A 3.0 128 0 0
14 A 2.0 8 0 0
15 A 1.8 133 0 0
16 A 1.0 82 0 0
17 A 0.0 10 0 0
18 A 0.0 15 0 0
19 A 1.0 62 0 0
20 A 3.0 11 0 0
21 A 0.0 68 0 0
22 A 1.3 18 0 0
23 A 2.0 24 0 0
24 A 3.0 37 0 0
25 A 4.3 55 0 0
26 A 4.8 83 0 0
27 A 5.0 48 18 46
28 A 4.2 13 0 0
29 A 3.5 81 0 0
30 A 3.0 43 0 0
31 A 2.6 36 0 0
32 A 2.0 56 0 0
33 A 1.9 43 0 0
34 A 1.0 36 0 0
35 A 0.0 46 0 0
36 A 0.0 45 0 0
37 A 0.0 127 0 0
38 A 0.0 55 0 0
39 A 0.0 67 0 0
40 A 0.1 113 0 0
41 A 0.2 98 0 0
42 B 0.3 78 0 0
43 B 1.0 78 0 0
44 B 2.0 57 0 0
45 B 3.0 131 0 0
46 B 5.0 121 78 64
47 B 4.0 126 0 0
48 B 2.0 142 0 0
49 B 0.5 51 0 0
50 B 0.0 64 0 0
51 B 0.0 95 0 0
For a given ID
if column_1
starts with a non zero value, it should give the first value of column_2
for that group.
The rest of the rows in Before
and After
can be filled with null
or zeroes.
Is there an elegant way to achieve this?
Solution
With where
and shift
you can generate series where only the endpoints of the subsequences are filled and the rest is padded with NaN
. Combined with ffill
(or backfill
, respectively), loc
ating the rows where column_1
equals 5 gives pretty much the desired result:
col1 = df.column_1
col2 = df.column_2
df.loc[col1.eq(5), "Before"] = col2.where(col1.gt(0) & col1.shift().eq(0)).ffill()
df.loc[col1.eq(5), "After"] = col2.where(col1.eq(0) & col1.shift().gt(0)).backfill()
Result:
ID column_1 column_2 Before After
0 A 0.0 13 NaN NaN
1 A 0.0 25 NaN NaN
2 A 0.0 96 NaN NaN
3 A 0.0 59 NaN NaN
4 A 0.1 5 NaN NaN
5 A 1.0 92 NaN NaN
6 A 1.5 82 NaN NaN
7 A 2.0 141 NaN NaN
8 A 3.0 50 NaN NaN
9 A 4.0 85 NaN NaN
10 A 4.5 84 NaN NaN
11 A 5.0 113 5.0 10.0
12 A 4.9 119 NaN NaN
13 A 3.0 128 NaN NaN
14 A 2.0 8 NaN NaN
15 A 1.8 133 NaN NaN
16 A 1.0 82 NaN NaN
17 A 0.0 10 NaN NaN
18 A 0.0 15 NaN NaN
19 A 1.0 62 NaN NaN
20 A 3.0 11 NaN NaN
21 A 0.0 68 NaN NaN
22 A 1.3 18 NaN NaN
23 A 2.0 24 NaN NaN
24 A 3.0 37 NaN NaN
25 A 4.3 55 NaN NaN
26 A 4.8 83 NaN NaN
27 A 5.0 48 18.0 46.0
28 A 4.2 13 NaN NaN
29 A 3.5 81 NaN NaN
30 A 3.0 43 NaN NaN
31 A 2.6 36 NaN NaN
32 A 2.0 56 NaN NaN
33 A 1.9 43 NaN NaN
34 A 1.0 36 NaN NaN
35 A 0.0 46 NaN NaN
36 A 0.0 45 NaN NaN
37 A 0.0 127 NaN NaN
38 A 0.0 55 NaN NaN
39 A 0.0 67 NaN NaN
With groupby()
def set_flags(frame):
col1 = frame.column_1
col2 = frame.column_2
m = col1.eq(5)
frame.loc[m, "Before"] = col2.where(col1.gt(0) & col1.shift(fill_value=0).eq(0)).ffill()
frame.loc[m, "After"] = col2.where(col1.eq(0) & col1.shift().gt(0)).backfill()
return frame
df[["Before", "After"]] = 0 # initializing columns with 0 keeps the int type
df = df.groupby("ID").apply(set_flags)
Result with new data
ID column_1 column_2 Before After
0 A 0.0 13 0 0
1 A 0.0 25 0 0
2 A 0.0 96 0 0
3 A 0.0 59 0 0
4 A 0.1 5 0 0
5 A 1.0 92 0 0
6 A 1.5 82 0 0
7 A 2.0 141 0 0
8 A 3.0 50 0 0
9 A 4.0 85 0 0
10 A 4.5 84 0 0
11 A 5.0 113 5 10
12 A 4.9 119 0 0
13 A 3.0 128 0 0
14 A 2.0 8 0 0
15 A 1.8 133 0 0
16 A 1.0 82 0 0
17 A 0.0 10 0 0
18 A 0.0 15 0 0
19 A 1.0 62 0 0
20 A 3.0 11 0 0
21 A 0.0 68 0 0
22 A 1.3 18 0 0
23 A 2.0 24 0 0
24 A 3.0 37 0 0
25 A 4.3 55 0 0
26 A 4.8 83 0 0
27 A 5.0 48 18 46
28 A 4.2 13 0 0
29 A 3.5 81 0 0
30 A 3.0 43 0 0
31 A 2.6 36 0 0
32 A 2.0 56 0 0
33 A 1.9 43 0 0
34 A 1.0 36 0 0
35 A 0.0 46 0 0
36 A 0.0 45 0 0
37 A 0.0 127 0 0
38 A 0.0 55 0 0
39 A 0.0 67 0 0
40 A 0.1 113 0 0
41 A 0.2 98 0 0
42 B 0.3 78 0 0
43 B 1.0 78 0 0
44 B 2.0 57 0 0
45 B 3.0 131 0 0
46 B 5.0 121 78 64
47 B 4.0 126 0 0
48 B 2.0 142 0 0
49 B 0.5 51 0 0
50 B 0.0 64 0 0
51 B 0.0 95 0 0
Answered By - fsimonjetz
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.