Issue
I am working on stratified kfold cv with decision tree model. The feature importance from decision model is different each time running the model. The accuracy is different as well each time. Can anyone help me understand why the result is different each time?
Also,below has 10 fold CV, so which fold do I use for feature importance? Do I need to find the overlap of feature importance from each fold?
Thanks
Here I just use the below code and change the model to decision tree model
https://www.geeksforgeeks.org/stratified-k-fold-cross-validation/
from statistics import mean, stdev
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold
from sklearn import linear_model
from sklearn import datasets
from sklearn import tree
import pandas as pd
# FEATCHING FEATURES AND TARGET VARIABLES IN ARRAY FORMAT.
cancer = datasets.load_breast_cancer()
# Input_x_Features.
x = cancer.data
# Input_ y_Target_Variable.
y = cancer.target
# Feature Scaling for input features.
scaler = preprocessing.MinMaxScaler()
x_scaled = scaler.fit_transform(x)
# Create classifier object.
# lr = linear_model.LogisticRegression()
lr=tree.DecisionTreeClassifier(criterion="gini")
# Create StratifiedKFold object.
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
lst_accu_stratified = []
for train_index, test_index in skf.split(x, y):
x_train_fold, x_test_fold = x_scaled[train_index], x_scaled[test_index]
y_train_fold, y_test_fold = y[train_index], y[test_index]
lr.fit(x_train_fold, y_train_fold)
lst_accu_stratified.append(lr.score(x_test_fold, y_test_fold))
# Print the output.
print('List of possible accuracy:', lst_accu_stratified)
print('\nMaximum Accuracy That can be obtained from this model is:',
max(lst_accu_stratified)*100, '%')
print('\nMinimum Accuracy:',
min(lst_accu_stratified)*100, '%')
print('\nOverall Accuracy:',
mean(lst_accu_stratified)*100, '%')
print('\nStandard Deviation is:', stdev(lst_accu_stratified))
print(lr.feature_importances_)
Solution
Can anyone help me understand why the result is different each time?
In this specific code and example it is due to randomness in Decision Trees. Small differences in the training data or the order of splitting can lead to variations in the tree structure. This affects which features are considered important.
You can make the runs reproducible by just setting the random_state
in DecisionTreeClassifier
lr=tree.DecisionTreeClassifier(criterion="gini", random_state=1)
Answered By - seralouk
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.