Thursday, January 20, 2022

[FIXED] Building ML classifier with imbalanced data

January 20, 2022 cross-validation, machine-learning, python, resampling, scikit-learn No comments

Issue

I have a dataset with 1400 obs and 19 columns. The Target variable has values 1 (value that I am most interested in) and 0. The distribution of classes shows imbalance (70:30).

Using the code below I am getting weird values (all 1s). I am not figuring out if this is due to a problem of overfitting/imbalance data or to feature selection (I used Pearson correlation since all values are numeric/boolean). I am thinking that the steps followed are wrong.

import numpy as np
import math
import sklearn.metrics as metrics
from sklearn.metrics import f1_score

y = df['Label']
X = df.drop('Label',axis=1)

def create_cv(X,y):
    if type(X)!=np.ndarray:
        X=X.values
        y=y.values
 
    test_size=1/5
    proportion_of_true=y[y==1].shape[0]/y.shape[0]
    num_test_samples=math.ceil(y.shape[0]*test_size)
    num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
    num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
    
    y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
    y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])

    X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
    X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
    return X_train,X_test,y_train,y_test

X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
    
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)       

y_predict_test = tree.predict(X_test)

print(classification_report(y_test, y_predict_test))
f1_score(y_test, y_predict_test)

Output:

     precision    recall  f1-score   support

           0       1.00      1.00      1.00        24
           1       1.00      1.00      1.00        70

    accuracy                           1.00        94
   macro avg       1.00      1.00      1.00        94
weighted avg       1.00      1.00      1.00        94

Has anyone experienced similar issues in building a classifier when data has imbalance, using CV and/or under sampling? Happy to share the whole dataset, in case you might want to replicate the output. What I would like to ask you for some clear answer to follow that can show me the steps and what I am doing wrong.

I know that, to reduce overfitting and work with balance data, there are some methods such as random sampling (over/under), SMOTE, CV. My idea is

Split the data on train/test taking into account imbalance
Perform CV on trains set
Apply undersampling only on a test fold
After the model has been chosen with the help of CV, undersample the train set and train the classifier
Estimate the performance on the untouched test set (f1-score)

as also outlined in this question: CV and under sampling on a test fold .

I think the steps above should make sense, but happy to receive any feedback that you might have on this.

Solution

When you have imbalanced data you have to perform stratification. The usual way is to oversample the class that has less values.

Another option is to train your algorithm with less data. If you have a good dataset that should not be a problem. In this case you grab first the samples from the less represented class use the size of the set to compute how many samples to get from the other class:

This code may help you split your dataset that way:

def split_dataset(dataset: pd.DataFrame, train_share=0.8):
    """Splits the dataset into training and test sets"""
    all_idx = range(len(dataset))
    train_count = int(len(all_idx) * train_share)

    train_idx = random.sample(all_idx, train_count)
    test_idx = list(set(all_idx).difference(set(train_idx)))

    train = dataset.iloc[train_idx]
    test = dataset.iloc[test_idx]

    return train, test

def split_dataset_stratified(dataset, target_attr, positive_class, train_share=0.8):
    """Splits the dataset as in `split_dataset` but with stratification"""

    data_pos = dataset[dataset[target_attr] == positive_class]
    data_neg = dataset[dataset[target_attr] != positive_class]

    if len(data_pos) < len(data_neg):
        train_pos, test_pos = split_dataset(data_pos, train_share)
        train_neg, test_neg = split_dataset(data_neg, len(train_pos)/len(data_neg))
        # set.difference makes the test set larger
        test_neg = test_neg.iloc[0:len(test_pos)]
    else:
        train_neg, test_neg = split_dataset(data_neg, train_share)
        train_pos, test_pos = split_dataset(data_pos, len(train_neg)/len(data_pos))
        # set.difference makes the test set larger
        test_pos = test_pos.iloc[0:len(test_neg)]

    return train_pos.append(train_neg).sample(frac = 1).reset_index(drop = True), \
           test_pos.append(test_neg).sample(frac = 1).reset_index(drop = True)

Usage:

train_ds, test_ds = split_dataset_stratified(data, target_attr, positive_class)

You can now perform cross validation on train_ds and evaluate your model in test_ds.

Answered By - JuanR

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 20, 2022

[FIXED] Building ML classifier with imbalanced data

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels