Friday, April 8, 2022

[FIXED] Why the sum "value" isn't equal to the number of "samples" in scikit-learn RandomForestClassifier?

April 08, 2022 decision-tree, machine-learning, python, random-forest, scikit-learn No comments

Issue

I built a random forest by RandomForestClassifier and plot the decision trees. What does the parameter "value" (pointed by red arrows) mean? And why the sum of two numbers in the [] doesn't equal to the number of "samples"? I saw some other examples, the sum of two numbers in the [] equals to the number of "samples". Why in my case, it doesn't?

df = pd.read_csv("Dataset.csv")
df.drop(['Flow ID', 'Inbound'], axis=1, inplace=True)
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace = True)
df.Label[df.Label == 'BENIGN'] = 0
df.Label[df.Label == 'DrDoS_LDAP'] = 1
Y = df["Label"].values
Y = Y.astype('int')
X = df.drop(labels = ["Label"], axis=1)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5)
model = RandomForestClassifier(n_estimators = 20)
model.fit(X_train, Y_train)
Accuracy = model.score(X_test, Y_test)
        
for i in range(len(model.estimators_)):
    fig = plt.figure(figsize=(15,15))
    tree.plot_tree(model.estimators_[i], feature_names = df.columns, class_names = ['Benign', 'DDoS'])
    plt.savefig('.\\TheForest\\T'+str(i))

Solution

Nice catch.

Although undocumented, this is due to the bootstrap sampling taking place by default in a Random Forest model (see my answer in Why is Random Forest with a single tree much better than a Decision Tree classifier? for more on the RF algorithm details and its difference from a mere "bunch" of decision trees).

Let's see an example with the iris data:

from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()

rf = RandomForestClassifier(max_depth = 3)
rf.fit(iris.data, iris.target)

tree.plot_tree(rf.estimators_[0]) # take the first tree

The result here is similar to what you report: for every other node except the lower right one, sum(value) does not equal samples, as it should be the case for a "simple" decision tree.

A cautious observer would have noticed something else which seems odd here: while the iris dataset has 150 samples:

print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class

and the base node of the tree should include all of them, the samples for the first node are only 89.

Why is that, and what exactly is going on here? To see, let us fit a second RF model, this time without bootstrap sampling (i.e. with bootstrap=False):

rf2 = RandomForestClassifier(max_depth = 3, bootstrap=False) # no bootstrap sampling
rf2.fit(iris.data, iris.target)

tree.plot_tree(rf2.estimators_[0]) # take again the first tree

Well, now that we have disabled bootstrap sampling, everything looks "nice": the sum of value in every node equals samples, and the base node contains indeed the whole dataset (150 samples).

So, the behavior you describe seems to be due to bootstrap sampling indeed, which, while creating samples with replacement (i.e. ending up with duplicate samples for each individual decision tree of the ensemble), these duplicate samples are not reflected in the sample values of the tree nodes, which display the number of unique samples; nevertheless, it is reflected in the node value.

The situation is completely analogous with that of a RF regression model - see own answer in sklearn RandomForestRegressor discrepancy in the displayed tree values.

Answered By - desertnaut

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, April 8, 2022

[FIXED] Why the sum "value" isn't equal to the number of "samples" in scikit-learn RandomForestClassifier?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels