Wednesday, February 7, 2024

[FIXED] labeling segments of data based on multiple conditions

February 07, 2024 indexing, pandas, python, python-3.x No comments

Issue

I have an indexed pandas dataframe with values like so:

data = {'ID':[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    'column1': [15, 16, 17, 14, 13, 5, 3, 2, 1.9, 1.2, 1, 0.8, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 2, 3, 4, 5, 6],
    'column2': [10, 11, 12, 13, 13.5, 14, 14.5, 15, 16, 17, 18, 19, 20, 20, 20, 20, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10]
    }


     ID  column1  column2
0    1     15.0     10.0
1    1     16.0     11.0
2    1     17.0     12.0
3    1     14.0     13.0
4    1     13.0     13.5
5    1      5.0     14.0
6    1      3.0     14.5
7    1      2.0     15.0
8    1      1.9     16.0
9    1      1.2     17.0
10   1      1.0     18.0
11   1      0.8     19.0
12   1      0.5     20.0
13   1      0.5     20.0
14   1      0.5     20.0
15   1      0.5     20.0
16   1      0.5     20.0
17   1      0.5     19.0
18   1      0.5     18.0
19   1      0.5     17.0
20   1      0.5     16.0
21   1      1.0     15.0
22   1      2.0     14.0
23   1      3.0     13.0
24   1      4.0     12.0
25   1      5.0     11.0
26   1      6.0     10.0

I want to label the rows based on the following conditions:

In column1, from the beginning until just before we first encounter a value <=2, AND when column2 is >13, label as Pre_Start
When 0.5 < column1 <= 2 AND < column2 <= 19 label as Start
When column1 <= 0.5 AND column2 >= 19 label as Steady
When column1 <= 0.5 AND 14 < column2 < 19 label as Ramp
When column1 > 0.5 AND column2 < 19 label as End

Such that the resulting dataframe would be something like this:

   ID  column1  column2      Label
0    1     15.0     10.0  Pre_Start
1    1     16.0     11.0  Pre_Start
2    1     17.0     12.0  Pre_Start
3    1     14.0     13.0  Pre_Start
4    1     13.0     13.5  Pre_Start
5    1      5.0     14.0  Pre_Start
6    1      3.0     14.5  Pre_Start
7    1      2.0     15.0      Start
8    1      1.9     16.0      Start
9    1      1.2     17.0      Start
10   1      1.0     18.0      Start
11   1      0.8     19.0      Start
12   1      0.5     20.0     Steady
13   1      0.5     20.0     Steady
14   1      0.5     20.0     Steady
15   1      0.5     20.0     Steady
16   1      0.5     20.0     Steady
17   1      0.5     19.0       Ramp
18   1      0.5     18.0       Ramp
19   1      0.5     17.0       Ramp
20   1      0.5     16.0       Ramp
21   1      1.0     15.0        End
22   1      2.0     14.0        End
23   1      3.0     13.0        End
24   1      4.0     12.0        End
25   1      5.0     11.0        End
26   1      6.0     10.0        End

I am able to successfully apply a grouped filter by defining a function start index using numpy like

def filter_group(group):
start_index = np.argmax(group['column1'].values <= 2)
return group.iloc[start_index:]

and calling the function using the apply clause:

filtered_df = df.groupby('ID', group_keys=False).apply(filter_group).reset_index(drop=True)

However, I can't get it to work with multiple conditions. Is there a more elegant way to achieve this?

Solution

You can use np.select to define labels and then modify them to detect the Pre_Start state:

def label_group(df):
    c1 = df['column1']
    c2 = df['column2']
    conds = [(0.5 <= c1) & (c1 <= 2) & (c2 > 14),  # Start
             (c1 < 0.5) & (c2 > 19),  # Steady
             (c1 < 0.5) & (14 <= c2) & (c2 <= 19),  # Ramp
             (c1 > 0.5) & (c2 < 14)]  # End
    choices = ['Start', 'Seady', 'Ramp', 'End']

    labels = np.select(condlist=conds, choicelist=choices)
    labels = pd.Series(labels, index=df.index)
    labels[:((c1 <= 2) & (c2 > 13)).argmax()] = 'Pre_Start'

    return labels.to_frame()
    
df['Label'] = df.groupby('ID').apply(label_group).droplevel('ID')

Note: conditions are not clear, maybe you can check them.

Answered By - Corralien

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, February 7, 2024

[FIXED] labeling segments of data based on multiple conditions

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels