Issue
I'm thinking of plotting a graph where the x axis is the complexity of the graph (e.g., n_neighbors in KNN), and the y_axis is the error (like mean squared error).
I'm currently using Grid Search CV and I realise that for .cv_results_
, they only show the train data error.
KNN={
'classifier': [KNeighborsClassifier()],
'classifier__n_neighbors': [i for i in range (10, 200, 10)],
}
pipeline = Pipeline(
steps = [('classifier', KNN["classifier"][0])]
)
grid_serach_knn = GridSearchCV(pipeline, [KNN], n_jobs=-1).fit(x_train, y_train)
grid_serach_knn.cv_results_
would give me
'split0_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96,0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'split1_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'split2_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96,0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'split3_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'split4_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'mean_test_score': array([0.97, 0.97, 0.97, 0.97, 0.97, 0.97, 0.96, 0.96, 0.96, 0.96, 0.96,0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96]),
'std_test_score': array([9.84e-04, 8.70e-04, 1.30e-03, 1.09e-03, 7.68e-04, 9.61e-04,1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16,1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16, 1.11e-16,1.11e-16]),
'rank_test_score': array([3, 2, 1, 4, 5, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7])}
firstly, i don't understand the different kinds of test scores. are they the training scores? if they are, what metrics are they using? accuracy/r^2/precision/recall?
secondly, how would i use model.predict(X_test)
for each iteration to find the error for the test dataset so that i can plot the graph at the top?
Solution
firstly, i don't understand the different kinds of test scores. are they the training scores? if they are, what metrics are they using? accuracy/r^2/precision/recall?
No, these are the validation scores. Each value is an array of 19 numbers, corresponding to the 19 different values for n_neighbors
. You did not specify the cv
parameter, so GridSearchCV
defaulted to splitting your training set into 5 parts and doing 5 runs, each time using one of those parts as the validation set and the other 4 to train the model. This is what the "split0" to "split4" names refer to. The values are all accuracy scores.
For example, "'split0_test_score': array([0.97, ..." tells you that after training the model with n_neighbors=10
on 80% of the training data, according to the first split, the model classified 97% of the instances in the remaining training data correctly.
The mean scores over the 5 splits and the corresponding standard deviations and ranks are also included in the results.
secondly, how would i use
model.predict(X_test)
for each iteration to find the error for the test dataset so that i can plot the graph at the top?
Note that GridSearchCV
has a parameter return_train_score
. Quoting from the scikit-learn docs:
return_train_score : bool, default=False
IfFalse
, thecv_results_
attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance.
So you can set that to True
to get the training scores and plot those as one curve, in addition to the validation curve. Scikit-learn even has a validation_curve
function to help with this, and an example of how to use it.
However, note that the plot you show does not mention (cross-)validation at all, and you say you have a separate test set that you want to use for the plot. So instead of doing any cross-validation, a simpler approach is to iterate over the n_neighbors
values, fit the model to the entire training set each time, and compute the accuracy scores of that model (e.g. with accuracy_score
), one for the training set and one for the test set. This approach is possible in your case, because the goal is to produce the plot, and you are not interested in any further hyperparameters apart from n_neighbors
.
Answered By - Arne
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.