Friday, March 4, 2022

[FIXED] Applying multiple conditions for multiple columns in pandas dataframe efficiently

March 04, 2022 apply, dataframe, pandas, python No comments

Issue

I have a DataFrame with dozens of columns.

Therapy area    Procedures1 Procedures2 Procedures3
Oncology        450         450         2345
Oncology        367         367         415
Oncology        152         152         4945
Oncology        876         876         345
Oncology        1098        1098        12
Oncology        1348        1348        234
Nononcology     225         225         345
Nononcology     300         300         44
Nononcology     267         267         45
Nononcology     90          90          4567

I want to change numeric values in all Procedure columns into buckets.

For one column it will be something like

def hello(x):
    if x['Therapy area'] == 'Oncology' and x['Procedures1'] < 200: return int(1)
    if x['Therapy area'] == 'Oncology' and x['Procedures1'] in range (200, 500): return 2
    if x['Therapy area'] == 'Oncology' and x['Procedures1'] in range (500, 1000): return 3
    if x['Therapy area'] == 'Oncology' and x['Procedures1'] > 1000: return 4
    if x['Therapy area'] != 'Oncology' and x['Procedures1'] < 200: return 11
    if x['Therapy area'] != 'Oncology' and x['Procedures1'] in range (200, 500): return 22
    if x['Therapy area'] != 'Oncology' and x['Procedures1'] in range (500, 1000): return 33
    if x['Therapy area'] != 'Oncology' and x['Procedures1'] > 1000: return 44  
test['Procedures1'] = test.apply(hello, axis=1)

What is the most efficient way to apply this for dozens of columns with different column names (not Procedures1, Procedures2, 'Procedures3` etc)?

UPD

I added the third column and the code does not work, giving the error.

ValueError: bins must increase monotonically.

Bins does not answer my question directly. I can have different values. I would prefer a solution with logical operations, not bins.

Also bins can be different for Nononcology, like 11, 22, 33, 44

Solution

You could apply pd.cut to the relevant columns:

cols = ['Procedures1', 'Procedures2']
df[cols] = df[cols].apply(lambda col: pd.cut(col, [0,200,500,1000, col.max()], labels=[1,2,3,4]))

Output:

  Therapy_area Procedures1 Procedures2
0     Oncology           2           2
1     Oncology           2           2
2     Oncology           1           1
3     Oncology           3           3
4     Oncology           4           4
5     Oncology           4           4
6  Nononcology           2           2
7  Nononcology           2           2
8  Nononcology           2           2
9  Nononcology           1           1

You could also use np.select:

def encoding(col, labels):
    return np.select([col<200, col.between(200,500), col.between(500,1000), col>1000], labels, 0)

onc_labels = [1,2,3,4]
nonc_labels = [11,22,33,44]
msk = df['Therapy_area'] == 'Oncology'

df[cols] = pd.concat((df.loc[msk, cols].apply(encoding, args=(onc_labels,)), df.loc[msk, cols].apply(encoding, args=(nonc_labels,)))).reset_index(drop=True)

Output:

  Therapy_area  Procedures1  Procedures2  Procedures3
0     Oncology            2            2            4
1     Oncology            2            2            2
2     Oncology            1            1            4
3     Oncology            3            3            2
4     Oncology            4            4            1
5     Oncology            4            4            2
6  Nononcology           22           22           44
7  Nononcology           22           22           22
8  Nononcology           11           11           44
9  Nononcology           33           33           22

Answered By - enke

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, March 4, 2022

[FIXED] Applying multiple conditions for multiple columns in pandas dataframe efficiently

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels