Sunday, July 10, 2022

[FIXED] Does more training data change the accuracy comparatively between models?

July 10, 2022 machine-learning, python, scikit-learn, seaborn No comments

Issue

When working on my ml project in the modeling phase I wanted to first start off by trying all possible models then picking the best and fine-tuning that model. Then finally I thought I would get the best model for the database but along the way, I found an interesting result.

For the multiple model training phase to save time I wanted to use around 3500 rows and out of my whole 70692 that's just about 4.9% of the data. Then when the training finally finished this was the result that I got for all the models:

=================================== Accuracy ===================================
                      name  accuracy
3  Support Vector Machines  0.752571
0      Logistic Regression  0.751429
9       Bagging Classifier  0.746857
1            Random Forest  0.742857
2                 LightGBM  0.742857
6    Bernoulli Naive Bayes  0.726857
4                  XGBoost  0.724571
5     Gaussian Naive Bayes  0.721143
7                      KNN  0.674857
8            Decision Tree  0.661143

================================== Precision ===================================
                      name precision
0      Logistic Regression  0.761427
9       Bagging Classifier  0.747583
3  Support Vector Machines  0.745568
6    Bernoulli Naive Bayes  0.743151
1            Random Forest  0.743041
2                 LightGBM  0.739451
5     Gaussian Naive Bayes  0.737986
4                  XGBoost  0.728355
7                      KNN   0.69409
8            Decision Tree  0.677714

============================== True Positive Rate ==============================
                      name true_positive_rate
3  Support Vector Machines           0.790929
2                 LightGBM           0.775442
9       Bagging Classifier           0.769912
1            Random Forest           0.767699
0      Logistic Regression           0.755531
4                  XGBoost           0.744469
6    Bernoulli Naive Bayes           0.720133
5     Gaussian Naive Bayes           0.713496
7                      KNN           0.662611
8            Decision Tree           0.655973

================================= Specificity ==================================
                      name specificity
3  Support Vector Machines    0.790929
2                 LightGBM    0.775442
9       Bagging Classifier    0.769912
1            Random Forest    0.767699
0      Logistic Regression    0.755531
4                  XGBoost    0.744469
6    Bernoulli Naive Bayes    0.720133
5     Gaussian Naive Bayes    0.713496
7                      KNN    0.662611
8            Decision Tree    0.655973

=================================== F1 Score ===================================
                      name     score
3  Support Vector Machines  0.767579
9       Bagging Classifier  0.758583
0      Logistic Regression  0.758468
2                 LightGBM  0.757019
1            Random Forest  0.755169
4                  XGBoost  0.736324
6    Bernoulli Naive Bayes  0.731461
5     Gaussian Naive Bayes  0.725534
7                      KNN  0.677985
8            Decision Tree  0.666667

Now from this, I didn't know what model to use and so I decided to try with 7000 rows almost double. At first, I taught the result would stay the same only the accuracy would increase but lo and behold there was a change in the order and this was my result with 7000 rows:

=================================== Accuracy ===================================
                      name  accuracy
9       Bagging Classifier  0.736571
2                 LightGBM  0.735429
3  Support Vector Machines     0.734
0      Logistic Regression  0.732857
1            Random Forest  0.730571
4                  XGBoost  0.721714
6    Bernoulli Naive Bayes      0.72
5     Gaussian Naive Bayes  0.711429
7                      KNN     0.674
8            Decision Tree  0.625429

================================== Precision ===================================
                      name precision
0      Logistic Regression  0.727174
6    Bernoulli Naive Bayes  0.726908
5     Gaussian Naive Bayes  0.725281
9       Bagging Classifier  0.719153
1            Random Forest  0.717895
3  Support Vector Machines  0.716049
2                 LightGBM  0.714576
4                  XGBoost  0.712533
7                      KNN  0.674612
8            Decision Tree   0.63009

============================== True Positive Rate ==============================
                      name true_positive_rate
2                 LightGBM           0.794466
9       Bagging Classifier           0.786561
3  Support Vector Machines           0.785997
1            Random Forest           0.770186
0      Logistic Regression           0.755505
4                  XGBoost           0.754376
6    Bernoulli Naive Bayes           0.715415
5     Gaussian Naive Bayes             0.6917
7                      KNN           0.687182
8            Decision Tree           0.629023

================================= Specificity ==================================
                      name specificity
2                 LightGBM    0.794466
9       Bagging Classifier    0.786561
3  Support Vector Machines    0.785997
1            Random Forest    0.770186
0      Logistic Regression    0.755505
4                  XGBoost    0.754376
6    Bernoulli Naive Bayes    0.715415
5     Gaussian Naive Bayes      0.6917
7                      KNN    0.687182
8            Decision Tree    0.629023

=================================== F1 Score ===================================
                      name     score
2                 LightGBM  0.752406
9       Bagging Classifier  0.751348
3  Support Vector Machines  0.749394
1            Random Forest  0.743122
0      Logistic Regression  0.741069
4                  XGBoost  0.732858
6    Bernoulli Naive Bayes  0.721116
5     Gaussian Naive Bayes  0.708092
7                      KNN  0.680839
8            Decision Tree  0.629556

The order changed and that surprised so my question is does more training data change the models' comparative accuracy to other models? or in my own understanding why does the above change in model ranking happen?

also, one more question I had is. Is there any way to plot all this data to make the finding of the all-around best model easier? Now I have all this data in 3 different panda Dataframe for plotting I just don't know what to do/ which plot to make and even how to make the plot.

Otherwise, that is all, and thank you in advance. :)

Do note when I say 3500 and 7000 i mean that is the total amount of rows I use which includes training and testing. I split the whole into 75% and 25% pieces and use 75% for training and 25% for testing

Solution

Q1. does change in datasize results models' comparative accuracy to other models?
A. some times yes and some times no
possibilities for yes

If change in datasize is large, then there are higher chances of shuffling in models performance metrices order unless there is no randomness in added data.
Adding more data means add in more number of outliers , including samples that are having more random independent values, increase in number of samples that are present in 3rd and 4th standard deviation and also changes in distribution of data.
Here in this scenario you have added 100% of data to previous data (3500 is 100% of 3500) let us assume it has 100 outliers , and also doubling the dataset as you mentioned. And model is also performed extra 100% of data.
First 50% of data may suit for Support Vector Machines, Logistic Regression, and due increase in randomness of data and higher chances for change in distribution of data, data may suits best for Bagging Classifier, LightGBM.

possibilities for no

If change in datasize is small, then there are lower chances of shuffling in models performance metrices order unless there is much randomness in added data.
Adding less data means adding less number of outliers (5% of 100(as I mentioned above) is 5), adding less number of unknown feature values to existed data.
If you have added 5% of data to previous data (175 for 3500 samples) which may contain (5 outliers). And model is performed extra on only 5% of data.
First 95% of data may suit for Support Vector Machines, Logistic Regression, and second 5% may suit for any other model, but on an average since 95% of data suits best for SVM, LR there are more chances that 100% of data also suits for SVM and LR.

Note : also in your situation there is no much difference in accuracy of SVM, LR, BC and LightGBM in first 50% therefore more chances of shuffling in leaderboard by adding another 50% of data.

Answered By - Hari Battula

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, July 10, 2022

[FIXED] Does more training data change the accuracy comparatively between models?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels