Issue
I'm trying to perform my first KNN Classifier using SciKit-Learn. I've been following the User Guide and other online examples but there are a few things I am unsure about. For this post lets use the following
X = data Y = target
1) In most introduction to machine learning pages that I've read it seems to say you want a training set, a validation set, and a test set. From what I understand, cross validation allows you to combine the training and validations sets to train the model, and then you should test it on the test set to get a score. However, I have seen in papers that in a lot of cases you can just cross validate on the entire data set and then report the CV score as the accuracy. I understand in an ideal world you would want to test on separate data but if this is legitimate I would like to cross-validate on my entire dataset and report those scores
2) So starting the process
I define my KNN Classifier as follows
knn = KNeighborsClassifier(algorithm = 'brute')
I search for best n_neighbors using
clf = GridSearchCV(knn, parameters, cv=5)
Now if I say
clf.fit(X,Y)
I can check the best parameter using
clf.best_params_
and then I can get a score
clf.score(X,Y)
But - as I understand it, this hasn't cross validated the model, as it only gives 1 score?
If I have seen clf.best_params_ = 14 now could I go on
knn2 = KNeighborsClassifier(n_neighbors = 14, algorithm='brute')
cross_val_score(knn2, X, Y, cv=5)
Now I know the data has been cross validated but I don't know if it is legitimate to use clf.fit to find the best parameter and then use cross_val_score with a new knn model?
3) I understand that the 'proper' way to do it would be as follows
Split to X_train, X_test, Y_train, Y_test, Scale train sets -> apply transform to test sets
knn = KNeighborsClassifier(algorithm = 'brute')
clf = GridSearchCV(knn, parameters, cv=5)
clf.fit(X_train,Y_train)
clf.best_params_
and then I can get a score
clf.score(X_test,Y_test)
In this case, is the score calculated using the best parameter?
I hope that this makes sense. I've been trying to find as much as I can without posting but I have come to the point where I think it would be easier to get some direct answers.
In my head I am trying to get some cross-validated scores using the whole dataset but also use a gridsearch (or something similar) to fine tune the parameters.
Thanks in advance
Solution
Yes you can CV on your entire dataset it is viable, but I still suggest you to at least split your data into 2 sets one for CV and one for testing.
The
.score
function is supposed to return a singlefloat
value according to the documentation which is the score of thebest estimator
(which is the best scored estimator you get from fitting yourGridSearchCV
) on the given X,Y- If you saw that the best parameter is 14 than yes you can go on whith using it in your model, but if you gave it more parameters you should set all of them. (- I say that because you haven't given your parameters list) And yes it is legitimate to check your CV once again just in case if this model is as good as it should.
Hope that makes the things clearer :)
Answered By - nitheism
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.