Issue
From this Tutorial and Feature Importance
I try to make my own random forest tree
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
X = df.loc[:, df.columns != 'target']
y = df.loc[:, 'target'].values
X_train, X_test, Y_train, Y_test = train_test_split(X, y, random_state=0)
rf = RandomForestClassifier(n_estimators=1,
max_depth=2,
max_features=2,
random_state=0)
rf.fit(X_train, Y_train)
rf.feature_importances_
array([0. , 0.11197953, 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0.88802047, 0. , 0. , 0. ])
fn=data.feature_names
cn=data.target_names
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=800)
tree.plot_tree(rf.estimators_[0],
feature_names = fn,
class_names=cn,
filled = True);
fig.savefig('rf_individualtree.png')
calculate the Feature Importance by hand from above Feature Importance (result from sklearn 0.11197953, 0.88802047)
a = (192/265)*(0.262-(68/192)*0.452-(124/192)*0.103)
b = (265/265)*(0.459-(192/265)*0.262-(73/265)*0.185)+(73/265)*(0.185-(72/73)*0.173)
print(b/(a+b))
print(a/(a+b))
0.8625754868011606
0.13742451319883947
Which part I did wrong my result is different from sklearn answer or sklearn just don't follow the formula?
Solution
You have couple of problems:
- Rounding error
- Math, specifically calculating probability of reaching a node
As soon as you correct them, you'll get the sklearn's result:
print(rf.estimators_[0].tree_.impurity)
array([0.45899182, 0.26172737, 0.10250188, 0.45244126, 0.18549346,
0.17300567, 0. ])
n1 = 0.45899182261015226 - (310/426)*0.26172736732570234 - (116/426)*0.1854934601664685
n2 = (116/426)*0.1854934601664685 - (115/426)*0.17300567107750475
n3 = (310/426)*0.26172736732570234 - (203/426)*0.10250188065713806 - (107/426)*0.45244126124552364
f1 = n1+n2
f2 = n3
print(f1/(f1+f2), f2/(f1+f2))
(0.888020474590027, 0.11197952540997297)
(You may read more on how importance is calculated here by package developers or here by reading the source code)
Note as well, what RandomForest
considers important may be not so important for another model (and vice versa), i.e. "importance" here is model specific, and probably may be not so intuitively understandable or expected by people, who are more accustomed to linear explainability.
Answered By - Sergey Bushmanov
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.