Friday, July 29, 2022

[FIXED] Using the predict_proba() function of RandomForestClassifier in the safe and right way

July 29, 2022 machine-learning, python, random-forest, scikit-learn No comments

Issue

I'm using Scikit-learn. Sometimes I need to have the probabilities of labels/classes instead of the labels/classes themselves. Instead of having Spam/Not Spam as labels of emails, I wish to have only for example: 0.78 probability a given email is Spam.

For such purpose, I'm using predict_proba() with RandomForestClassifier as following:

clf = RandomForestClassifier(n_estimators=10, max_depth=None,
    min_samples_split=1, random_state=0)
scores = cross_val_score(clf, X, y)
print(scores.mean())

classifier = clf.fit(X,y)
predictions = classifier.predict_proba(Xtest)
print(predictions)

And I got those results:

 [ 0.4  0.6]
 [ 0.1  0.9]
 [ 0.2  0.8]
 [ 0.7  0.3]
 [ 0.3  0.7]
 [ 0.3  0.7]
 [ 0.7  0.3]
 [ 0.4  0.6]

Where the second column is for class: Spam. However, I have two main issues with the results about which I am not confident. The first issue is that the results represent the probabilities of the labels without being affected by the size of my data? The second issue is that the results show only one digit which is not very specific in some cases where the 0.701 probability is very different from 0.708. Is there any way to get the next 5 digit for example?

Solution

I get more than one digit in my results, are you sure it is not due to your dataset ? (for example using a very small dataset would yield to simple decision trees and so to 'simple' probabilities). Otherwise it may only be the display that shows one digit, but try to print predictions[0,0].
I am not sure to understand what you mean by "the probabilities aren't affected by the size of my data". If your concern is that you don't want to predict, eg, too many spams, what is usually done is to use a threshold t such that you predict 1 if proba(label==1) > t. This way you can use the threshold to balance your predictions, for example to limit the global probabilty of spams. And if you want to globally analyse your model, we usually compute the Area under the curve (AUC) of the Receiver operating characteristic (ROC) curve (see wikipedia article here). Basically the ROC curve is a description of your predictions depending on the threshold t.

Hope it helps!

Answered By - Sebastien

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, July 29, 2022

[FIXED] Using the predict_proba() function of RandomForestClassifier in the safe and right way

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels