Issue
I'm using GridSearchCV to find the best hyperparameter in my SVM model. But I am a little bit confused about the scoring. This is my grid search code:
# Train SVM with GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('SVM', SVC(kernel='rbf', decision_function_shape='ovo'))
])
param_grid = {
'SVM__C': [1, 10, 100, 1000],
'SVM__gamma': [1, 0.1, 0.01, 0.001]
}
clf = GridSearchCV(pipe, param_grid, scoring='accuracy', verbose = 3, cv=5)
clf.fit(X_train, y_train)
Output:
GridSearchCV(cv=5,
estimator=Pipeline(steps=[('scaler', StandardScaler()),
('SVM',
SVC(decision_function_shape='ovo'))]),
param_grid={'SVM__C': [1, 10, 100, 1000],
'SVM__gamma': [1, 0.1, 0.01, 0.001]},
scoring='accuracy', verbose=3)
Then I tried to print the best score and the test accuracy
print('Best score: ', clf.best_score_)
print('Test Accuracy: ', clf.score(X_test, y_test)
And it returns
Best score: 0.5501906602583355
Test accuracy: 0.5809569840502659
Why the score between the two is different? As far as I know, the best_score_
is the max value of the mean_test_score
in cv_results_
, but why is the test accuracy score is higher than the best score? I am still confused about this.
Solution
TLDR: The two scores are not referring to the same 'test' set. One is looking at the 'test' scores from the CV and the other is from the separate test set.
This is because the CV (cross validation) is done on the training data provided (here X_train
and y_train
). The best_score
is the best score produced on the test folds from your training data.
On the other hand, clf.score(X_test, y_test)
is giving you the score (accuracy) on your test set. These two do not (and in general will not) be equal. This test data is not part of your training data - or at least should not be.
Answered By - Martin Dinov
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.