Monday, November 8, 2021

[FIXED] My classifier gives 1.0 accuracy on ALL test data set (except wrong photos)

November 08, 2021 dataset, machine-learning, python, scikit-learn No comments

Issue

Have:

Dataset: 115 color images with 256x256 size, all photos belongs to ONE class (cartoon person).
Classifiers: KNN and Random Forest Classifier.

Comment: i wanted to make a classifier to predict ONE cartoon person on some photo, so i've collected dataset, digitized it and put it in fit method of classifiers. So at first i choosed SGDClassifier, but it works only with 2 and more classes in dataset. So then choosed KNN and Random Forest Classifier.

Problem: when i try to test my ready classifiers, i got 1.0 score on EVERY photo (i tested that 1 object, 1 another object (another cartoon person) and a photo of black screen) and they all had 1.0 score anyway.

Can somebody help me please? : ( I am stuck on this 2 days already and don't see ways to solve it by myself, i watched many solutions, but none of them worked in my case.

Dataset:

The shape of my dataset numpy array is (115, 196608) and (for example) one image in my dataset numpy array looks this:

Dataset is a 2D array, because classifiers take only 1D or 2D arrays.

Code: it's not full, just for an example

    train_data_values = numpy.array([*115 photos*])
    train_data_labels = numpy.array([*115 labels*])
    # For fact, all my labels equal "1", there is no other value.
    
    # Trying KNN
    from sklearn.neighbors import KNeighborsClassifier
    
    KNN_clf = KNeighborsClassifier(**{'n_neighbors': 16, 'weights': 'distance'})
    KNN_clf.fit(train_data_values, train_data_labels)
    
    test_im = cv2.imread(DATASET_IMAGES_DIRECTORY + "\\test\\" + "test2.png")
    
    KNN_clf.predict_proba(test_im.reshape(1, 3*256*256)) # Returns array([[1.]])
    
    # Trying Random Forest Classifier
    from sklearn.ensemble import RandomForestClassifier
    
    RF_clf = RandomForestClassifier()
    RF_clf.fit(train_data_values, train_data_labels)
    
    test_im = cv2.imread(DATASET_IMAGES_DIRECTORY + "\\test\\" + "test.png")
    
    RF_clf.predict_proba(test_im.reshape(1, 3*256*256)) # Returns array([[1.]])

Comment: i looked on images in my numpy dataset, because i thought they can be bad digitized, but NO, they can be built easily from array to image. P.S. Parameters for KNN classifier are random, because i've been trying grid search for best parameteres, but there were again 1.0 scores everywhere.

Solution

All classifiers learn their scores from their training data. And scores of most classifiers (including random forest and KNN) have probabilistic meaning: they are tuned to reflect the probabilistic distribution of the training data as well as possible.

So if your training data consists of 100% of a single class, then the classifier will learn that with 100% probability any sample belongs to this class, and will predict this class with absolute confidence.

The lesson: to use any classifier, you need at least two classes, otherwise, the prediction will be more or less meaningless. My recommendation is to add negative samples, that is, samples without your target person, including:

images with other persons from your and other cartoons
images with background only and without persons
images with some non-animated objects

There are a few exceptions, such as OneClassSVM, that are (presumable) capable of producing meaningful scores being trained on a single class. But whether they work adequately on your data, that you will never know, until you test them with data from several different classes.

Answered By - David Dale

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 8, 2021

[FIXED] My classifier gives 1.0 accuracy on ALL test data set (except wrong photos)

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels