Thursday, September 8, 2022

[FIXED] get_feature_names_out from Linear SVC pipeline and sklearn

September 08, 2022 feature-extraction, python, scikit-learn No comments

Issue

i'm struggling to understand what are the most used feature for text classification.

i've been trying 2 methods found here on stackoverflow

the first

def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names_out()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top10)))
class_labels=clf.classes_

and the second

def printNMostInformative(vectorizer, clf, N):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    topClass1 = coefs_with_fns[:N]
    topClass2 = coefs_with_fns[:-(N + 1):-1]
    print("Class 1 best: ")
    for feat in topClass1:
        print(feat)
    print("Class 2 best: ")
    for feat in topClass2:
        print(feat)

in both cases i only get accuracy and an empty list

accuracy: 0.37922705314009664
Top 10 features used to predict: 
Class 1 best: 
(0.008202041988712563, '')
Class 2 best: 
(0.008202041988712563, '')

the entire code if needed is really similar, since it's and update with a new dataset, of this notebook https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/machine%20learning%20spaCy.ipynb

Solution

For a Linear SVM it can be done like below:

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
y = [1, 0, 0, 1] # labesl
features_names = vectorizer.get_feature_names_out()
svm = svm.SVC(kernel='linear')
svm.fit(X, y)
importance = {f:i for f, i in zip(features_names, svm.coef_.toarray()[0])}
sorted_importance = {f:i for f, i in sorted(importance.items(), key = lambda item:np.abs(item[1]), reverse=True)}
print(sorted_importance)

When a Non-linear kernel is used:

svm = svm.SVC(kernel='rbf')
svm.fit(X, y)
perm_importance = permutation_importance(svm, X.toarray(), y)
features = np.array(features_names)
sorted_idx = perm_importance.importances_mean.argsort()
importance = {features_names[i]:perm_importance.importances_mean[i] for i in sorted_idx}
sorted_importance = {f:i for f, i in sorted(importance.items(), key = lambda item:np.abs(item[1]), reverse=True)}
print(sorted_importance)

Answered By - meti

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, September 8, 2022

[FIXED] get_feature_names_out from Linear SVC pipeline and sklearn

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels