Thursday, November 30, 2023

[FIXED] why feature importance from decision tree model is different each run?

November 30, 2023 python, scikit-learn No comments

Issue

I am working on stratified kfold cv with decision tree model. The feature importance from decision model is different each time running the model. The accuracy is different as well each time. Can anyone help me understand why the result is different each time?

Also,below has 10 fold CV, so which fold do I use for feature importance? Do I need to find the overlap of feature importance from each fold?

Thanks

Here I just use the below code and change the model to decision tree model

https://www.geeksforgeeks.org/stratified-k-fold-cross-validation/

from statistics import mean, stdev
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold
from sklearn import linear_model
from sklearn import datasets
from sklearn import tree
import pandas as pd
  
# FEATCHING FEATURES AND TARGET VARIABLES IN ARRAY FORMAT.
cancer = datasets.load_breast_cancer()
# Input_x_Features.
x = cancer.data        
 
# Input_ y_Target_Variable.
y = cancer.target                       
   
# Feature Scaling for input features.
scaler = preprocessing.MinMaxScaler()
x_scaled = scaler.fit_transform(x)
  
# Create  classifier object.
# lr = linear_model.LogisticRegression()
lr=tree.DecisionTreeClassifier(criterion="gini")
  
# Create StratifiedKFold object.
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
lst_accu_stratified = []
  
for train_index, test_index in skf.split(x, y):
    x_train_fold, x_test_fold = x_scaled[train_index], x_scaled[test_index]
    y_train_fold, y_test_fold = y[train_index], y[test_index]
    lr.fit(x_train_fold, y_train_fold)
    lst_accu_stratified.append(lr.score(x_test_fold, y_test_fold))
  
# Print the output.
print('List of possible accuracy:', lst_accu_stratified)
print('\nMaximum Accuracy That can be obtained from this model is:',
      max(lst_accu_stratified)*100, '%')
print('\nMinimum Accuracy:',
      min(lst_accu_stratified)*100, '%')
print('\nOverall Accuracy:',
      mean(lst_accu_stratified)*100, '%')
print('\nStandard Deviation is:', stdev(lst_accu_stratified))

print(lr.feature_importances_)

Solution

Can anyone help me understand why the result is different each time?

In this specific code and example it is due to randomness in Decision Trees. Small differences in the training data or the order of splitting can lead to variations in the tree structure. This affects which features are considered important.

You can make the runs reproducible by just setting the random_state in DecisionTreeClassifier

lr=tree.DecisionTreeClassifier(criterion="gini", random_state=1)

Answered By - seralouk

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 30, 2023

[FIXED] why feature importance from decision tree model is different each run?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels