Issue
I have some data (MFCC features for speaker recognition), from two different speakers. 60 vectors of 13 features for each person (in total 120). Each of them has their label (0 and 1). I need to show the results on confusion matrix. But GaussianMixture
model from sklearn is unstable. For each program run i receive different scores (sometimes accuracy is 0.4, sometimes 0.7 ...). I don't know what I am doing wrong, because analogically i created SVM and k-NN models and they are working fine (stable accuracy around 0.9). Do you have any idea what am I doing wrong?
gmmclf = GaussianMixture(n_components=2, covariance_type='diag')
gmmclf.fit(X_train, y_train) #X_train are mfcc vectors, y_train are labels
ygmm_pred_class = gmmclf.predict(X_test)
print(accuracy_score(y_test, ygmm_pred_class))
print(confusion_matrix(y_test, ygmm_pred_class))
Solution
Short answer: you should simply not use a GMM for classification.
Long answer...
From the answer to a relevant thread, Multiclass classification using Gaussian Mixture Models with scikit learn (emphasis in the original):
Gaussian Mixture is not a classifier. It is a density estimation method, and expecting that its components will magically align with your classes is not a good idea. [...] GMM simply tries to fit mixture of Gaussians into your data, but there is nothing forcing it to place them according to the labeling (which is not even provided in the fit call). From time to time this will work - but only for trivial problems, where classes are so well separated that even Naive Bayes would work, in general however it is simply invalid tool for the problem.
And a comment by the respondent himself (again, emphasis in the original):
As stated in the answer - GMM is not a classifier, so asking if you are using "GMM classifier" correctly is impossible to answer. Using GMM as a classifier is incorrect by definition, there is no "valid" way of using it in such a problem as it is not what this model is designed to do. What you could do is to build a proper generative model per class. In other words construct your own classifier where you fit one GMM per label and then use assigned probability to do actual classification. Then it is a proper classifier. See github.com/scikit-learn/scikit-learn/pull/2468
(For what it may worth, you may want to notice that the respondent is a research scientist in DeepMind, and the very first person to be awarded the machine-learning
gold badge here at SO)
To elaborate further (and that's why I didn't simply flag the question as a duplicate):
It is true that in the scikit-learn documentation there is a post titled GMM classification:
Demonstration of Gaussian mixture models for classification.
which I guess did not exist back in 2017, when the above response was written. But, digging into the provided code, you will realize that the GMM models are actually used there in the way proposed by lejlot above; there is no statement in the form of classifier.fit(X_train, y_train)
- all usage is in the form of classifier.fit(X_train)
, i.e. without using the actual labels.
This is exactly what we would expect from a clustering-like algorithm (which is indeed what GMM is), and not from a classifier. It is true again that scikit-learn offers an option for providing also the labels in the GMM fit
method:
fit
(self, X, y=None)
which you have actually used here (and again, probably did not exist back in 2017, as the above response implies), but, given what we know about GMMs and their usage, it is not exactly clear what this parameter is there for (and, permit me to say, scikit-learn has its share on practices that may look sensible from a purely programming perspective, but which made very little sense from a modeling perspective).
A final word: although fixing the random seed (as suggested in a comment) may appear to "work", trusting a "classifier" that gives a range of accuracies between 0.4 and 0.7 depending on the random seed is arguably not a good idea...
Answered By - desertnaut
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.