Issue
I have a DataFrame with dozens of columns.
Therapy area Procedures1 Procedures2 Procedures3
Oncology 450 450 2345
Oncology 367 367 415
Oncology 152 152 4945
Oncology 876 876 345
Oncology 1098 1098 12
Oncology 1348 1348 234
Nononcology 225 225 345
Nononcology 300 300 44
Nononcology 267 267 45
Nononcology 90 90 4567
I want to change numeric values in all Procedure
columns into buckets.
For one column it will be something like
def hello(x):
if x['Therapy area'] == 'Oncology' and x['Procedures1'] < 200: return int(1)
if x['Therapy area'] == 'Oncology' and x['Procedures1'] in range (200, 500): return 2
if x['Therapy area'] == 'Oncology' and x['Procedures1'] in range (500, 1000): return 3
if x['Therapy area'] == 'Oncology' and x['Procedures1'] > 1000: return 4
if x['Therapy area'] != 'Oncology' and x['Procedures1'] < 200: return 11
if x['Therapy area'] != 'Oncology' and x['Procedures1'] in range (200, 500): return 22
if x['Therapy area'] != 'Oncology' and x['Procedures1'] in range (500, 1000): return 33
if x['Therapy area'] != 'Oncology' and x['Procedures1'] > 1000: return 44
test['Procedures1'] = test.apply(hello, axis=1)
What is the most efficient way to apply this for dozens of columns with different column names (not Procedures1
, Procedures2
, 'Procedures3` etc)?
UPD
I added the third column and the code does not work, giving the error.
ValueError: bins must increase monotonically.
Bins does not answer my question directly. I can have different values. I would prefer a solution with logical operations, not bins.
Also bins can be different for Nononcology, like 11, 22, 33, 44
Solution
You could apply
pd.cut
to the relevant columns:
cols = ['Procedures1', 'Procedures2']
df[cols] = df[cols].apply(lambda col: pd.cut(col, [0,200,500,1000, col.max()], labels=[1,2,3,4]))
Output:
Therapy_area Procedures1 Procedures2
0 Oncology 2 2
1 Oncology 2 2
2 Oncology 1 1
3 Oncology 3 3
4 Oncology 4 4
5 Oncology 4 4
6 Nononcology 2 2
7 Nononcology 2 2
8 Nononcology 2 2
9 Nononcology 1 1
You could also use np.select
:
def encoding(col, labels):
return np.select([col<200, col.between(200,500), col.between(500,1000), col>1000], labels, 0)
onc_labels = [1,2,3,4]
nonc_labels = [11,22,33,44]
msk = df['Therapy_area'] == 'Oncology'
df[cols] = pd.concat((df.loc[msk, cols].apply(encoding, args=(onc_labels,)), df.loc[msk, cols].apply(encoding, args=(nonc_labels,)))).reset_index(drop=True)
Output:
Therapy_area Procedures1 Procedures2 Procedures3
0 Oncology 2 2 4
1 Oncology 2 2 2
2 Oncology 1 1 4
3 Oncology 3 3 2
4 Oncology 4 4 1
5 Oncology 4 4 2
6 Nononcology 22 22 44
7 Nononcology 22 22 22
8 Nononcology 11 11 44
9 Nononcology 33 33 22
Answered By - enke
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.