Issue
I'm on a classification issue with: 2,500 lines. 25000 columns 88 different classes unevenly distributed
And then something very strange happened:
When I run a dozen different split test trains, I always get scores around 60%...
And when I run cross validations, I always get scores around 50%. Here the screen : Moreover it has nothing to do with the unequal distribution of classes because when I put a stratify=y on the TTS I stay around 60% and when I put a StratifiedKFold I stay around 50%.
Which score to remember? Why the difference? For me a CV was just a succession of test train splits with different splits from each other, so nothing justifies such a difference in score.
Solution
Short answer: Add shuffle=True to your KFold : cross_val_score(forest,X,y,cv=KFold(shuffle=True))
Long answer: the difference between a succession of TrainTestSplit and a cross-validation with a classic KFold is that there is a mix in the TTS before the split between the train and the test set. The difference in score may be due to the fact that your dataset is sorted in a biased way. So just add shuffle=True to your KFold (or your StratifiedKFold and that's all you need to do).
Answered By - Chaussette
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.