Issue
I used the "Stroke" data set from kaggle to compare the accuracy of the following different models of classification:
K-Nearest-Neighbor (KNN)
.Decision Trees
.Adaboost
.Logistic Regression
.
I did not implement the models myself, but used sklearn library's implementations.
After training the models I ran the test data and printed the level of accuracy of each of the models and these are the results:
As you can see, KNN
, Adaboost
, and Logistic Regression
gave me the exact same accuracy.
My question is, does it make sense that there is not even a small difference between them or did I make a mistake somewhere along the way (Even though I only used sklearn
's implementations?
Solution
In general achieving the same scores is unlikely, and the explanation is usually:
- bug in actual reporting
- bug in the data processing
- score reported corresponds to a degenerate solution
And the last explanation is probably the case. Stroke dataset has 249 positive samples in 5000 datapoints, so if your model always says "no stroke" it will get roughly 95%. So my best guess is that all your models failed to learn anything and are just constantly outputting "0".
In general accuracy is not a right metric for highly imabalnced datasets. Consider balanced accuracy, f1, etc.
Answered By - lejlot
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.