Issue
i'm struggling to understand what are the most used feature for text classification.
i've been trying 2 methods found here on stackoverflow
the first
def print_top10(vectorizer, clf, class_labels):
"""Prints features with the highest coefficient values, per class"""
feature_names = vectorizer.get_feature_names_out()
for i, class_label in enumerate(class_labels):
top10 = np.argsort(clf.coef_[i])[-10:]
print("%s: %s" % (class_label,
" ".join(feature_names[j] for j in top10)))
class_labels=clf.classes_
and the second
def printNMostInformative(vectorizer, clf, N):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
topClass1 = coefs_with_fns[:N]
topClass2 = coefs_with_fns[:-(N + 1):-1]
print("Class 1 best: ")
for feat in topClass1:
print(feat)
print("Class 2 best: ")
for feat in topClass2:
print(feat)
in both cases i only get accuracy and an empty list
accuracy: 0.37922705314009664
Top 10 features used to predict:
Class 1 best:
(0.008202041988712563, '')
Class 2 best:
(0.008202041988712563, '')
the entire code if needed is really similar, since it's and update with a new dataset, of this notebook https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/machine%20learning%20spaCy.ipynb
Solution
For a Linear
SVM it can be done like below:
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
y = [1, 0, 0, 1] # labesl
features_names = vectorizer.get_feature_names_out()
svm = svm.SVC(kernel='linear')
svm.fit(X, y)
importance = {f:i for f, i in zip(features_names, svm.coef_.toarray()[0])}
sorted_importance = {f:i for f, i in sorted(importance.items(), key = lambda item:np.abs(item[1]), reverse=True)}
print(sorted_importance)
When a Non-linear
kernel is used:
svm = svm.SVC(kernel='rbf')
svm.fit(X, y)
perm_importance = permutation_importance(svm, X.toarray(), y)
features = np.array(features_names)
sorted_idx = perm_importance.importances_mean.argsort()
importance = {features_names[i]:perm_importance.importances_mean[i] for i in sorted_idx}
sorted_importance = {f:i for f, i in sorted(importance.items(), key = lambda item:np.abs(item[1]), reverse=True)}
print(sorted_importance)
Answered By - meti
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.