Thursday, July 14, 2022

[FIXED] Class_weight and sample_weight ineffective for sklearn Random Forest

July 14, 2022 imbalanced-data, machine-learning, python, scikit-learn No comments

Issue

I'm new to ML and I've been working with an imbalanced data set where the count of negative samples is twice that of the positive samples. In-order to address these i set scikit-learn Random forest class_weight = 'balanced', which gave me an ROC-AUC score of 0.904 and the recall for class- 1 was 0.86, now when i tried to further improve the AUC Score by assigning weight, there wasn't any major difference with the results, i.e Class_weight = {0: 0.5, 1: 2.75}, assuming this would penalize for every wrong classification of 1 but it didn't seem to work as expected.

randomForestClf = RandomForestClassifier(random_state = 42, class_weight = {0: 0.5, 1:2.75})

Tried different values but has no major impact as Recall of 1 remains the same or reduces (0.85) and auc value is quite insignificant (0.90122). It only seems to work when one of the label is set 0. Further tried to set the sample weights too. But that didn't seem to work either.

# Sample weights
class_weights = [0.5, 2]
weights = np.ones(y_train.shape[0], dtype = 'float')
for i, val in enumerate(y_train):
    weights[i] = class_weights[val]

Below is the reference to a similar question but the solutions provided didn't work for me. sklearn RandomForestClassifier's class_weights seems to have no effect

Is there anything that i'm missing out? Thanks!

Solution

The reason is that you grow the trees out fully, which leads to every leaf node being pure. That will happen regardless of the class weights (though the structure of the tree leading up to those pure nodes will change). The predicted probabilities of each tree will be (almost) all 0 or 1, and so the overall probability estimates are just driven by disagreements between the trees.

If you set e.g. max_depth=10 (or whatever tree complexity parameter you like), now many/most of the leaf nodes will not be pure. Setting larger positive-class weights will produce leaf values that are biased toward the positive class (but still aren't just 0 and 1), and so the probability estimates will be skewed higher across the board, leading to a higher recall (at the expense of precision, presumably).

The ROC curve is relatively unaffected by class balance and the skewed-higher probabilities arising from the larger weights, and so shouldn't be heavily affected by changing weights, for a fixed max_depth.

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, July 14, 2022

[FIXED] Class_weight and sample_weight ineffective for sklearn Random Forest

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels