Issue
When using statified kfold cv, we fit the model for each of 5 folds for example so we got 5 models for each fold respectively, Then my question is what is the final model to use for prediction? For example, in below code, the code get accuracy result of each of 10 fold, then which kfold model to use after traning and fitting the data? Do we just use a specific model with a specific fold with highest accuracy?
https://www.geeksforgeeks.org/stratified-k-fold-cross-validation/
# Import Required Modules.
from statistics import mean, stdev
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold
from sklearn import linear_model
from sklearn import datasets
# FEATCHING FEATURES AND TARGET VARIABLES IN ARRAY FORMAT.
cancer = datasets.load_breast_cancer()
# Input_x_Features.
x = cancer.data
# Input_ y_Target_Variable.
y = cancer.target
# Feature Scaling for input features.
scaler = preprocessing.MinMaxScaler()
x_scaled = scaler.fit_transform(x)
# Create classifier object.
lr = linear_model.LogisticRegression()
# Create StratifiedKFold object.
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
lst_accu_stratified = []
for train_index, test_index in skf.split(x, y):
x_train_fold, x_test_fold = x_scaled[train_index], x_scaled[test_index]
y_train_fold, y_test_fold = y[train_index], y[test_index]
lr.fit(x_train_fold, y_train_fold)
lst_accu_stratified.append(lr.score(x_test_fold, y_test_fold))
# Print the output.
print('List of possible accuracy:', lst_accu_stratified)
print('\nMaximum Accuracy That can be obtained from this model is:',
max(lst_accu_stratified)*100, '%')
print('\nMinimum Accuracy:',
min(lst_accu_stratified)*100, '%')
print('\nOverall Accuracy:',
mean(lst_accu_stratified)*100, '%')
print('\nStandard Deviation is:', stdev(lst_accu_stratified))
Solution
we don't choose any of the models build by the k-fold cross validation as the final model. Instead, we use k-fold CV
(i) to choose the hyper parameters from the model which gives highest accuracy, and we use these hyper parameters to train the model on the entire dataset.
(ii) to understand the average performance of the model over multiple iterations across different subsets by looking at the mean score of the performance.
Answered By - Developer
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.