Saturday, January 29, 2022

[FIXED] How Feature Importance is calculated in sklearn's RandomForest?

January 29, 2022 python, scikit-learn No comments

Issue

From this Tutorial and Feature Importance
I try to make my own random forest tree

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


X = df.loc[:, df.columns != 'target']
y = df.loc[:, 'target'].values


X_train, X_test, Y_train, Y_test = train_test_split(X, y, random_state=0)


rf = RandomForestClassifier(n_estimators=1,
                            max_depth=2,
                            max_features=2,
                            random_state=0)
rf.fit(X_train, Y_train)

rf.feature_importances_
array([0.        , 0.11197953, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.88802047, 0.        , 0.        , 0.        ])

fn=data.feature_names
cn=data.target_names
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=800)
tree.plot_tree(rf.estimators_[0],
               feature_names = fn, 
               class_names=cn,
               filled = True);
fig.savefig('rf_individualtree.png')

calculate the Feature Importance by hand from above Feature Importance (result from sklearn 0.11197953, 0.88802047)

a = (192/265)*(0.262-(68/192)*0.452-(124/192)*0.103) 
b = (265/265)*(0.459-(192/265)*0.262-(73/265)*0.185)+(73/265)*(0.185-(72/73)*0.173)

print(b/(a+b))
print(a/(a+b))

0.8625754868011606
0.13742451319883947

Which part I did wrong my result is different from sklearn answer or sklearn just don't follow the formula?

Solution

You have couple of problems:

Rounding error
Math, specifically calculating probability of reaching a node

As soon as you correct them, you'll get the sklearn's result:

print(rf.estimators_[0].tree_.impurity)

array([0.45899182, 0.26172737, 0.10250188, 0.45244126, 0.18549346,
       0.17300567, 0.        ])

n1 = 0.45899182261015226 - (310/426)*0.26172736732570234 - (116/426)*0.1854934601664685
n2 = (116/426)*0.1854934601664685 - (115/426)*0.17300567107750475
n3 = (310/426)*0.26172736732570234 - (203/426)*0.10250188065713806 - (107/426)*0.45244126124552364
f1 = n1+n2
f2 = n3
print(f1/(f1+f2), f2/(f1+f2))

(0.888020474590027, 0.11197952540997297)

(You may read more on how importance is calculated here by package developers or here by reading the source code)

Note as well, what RandomForest considers important may be not so important for another model (and vice versa), i.e. "importance" here is model specific, and probably may be not so intuitively understandable or expected by people, who are more accustomed to linear explainability.

Answered By - Sergey Bushmanov

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 29, 2022

[FIXED] How Feature Importance is calculated in sklearn's RandomForest?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels