Issue
I'm trying to use scikit-learn's Bernoulli Naive Bayes classifier. I had the classifier working fine on a small data set using CountVectorizor, but ran into trouble when I tried to use HashingVectorizor for working with a larger data set. Holding all other parameters (training documents, test documents, classifier & feature extractor settings) constant and just switching from CountVectorizor to HashingVectorizor caused my classifier to always spit out the same label for all documents.
I wrote the following script to investigate what would be different between the two feature extractors:
from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
cv = CountVectorizer(binary=True, decode_error='ignore')
h = HashingVectorizer(binary=True, decode_error='ignore')
with open('moby_dick.txt') as fp:
doc = fp.read()
cv_result = cv.fit_transform([doc])
h_result = h.transform([doc])
print cv_result
print repr(cv_result)
print h_result
print repr(h_result)
(where 'moby_dick.txt' is the project gutenberg copy of moby dick)
The (condensed) results:
(0, 17319) 1
(0, 17320) 1
(0, 17321) 1
<1x17322 sparse matrix of type '<type 'numpy.int64'>'
with 17322 stored elements in Compressed Sparse Column format>
(0, 1048456) 0.00763203138591
(0, 1048503) 0.00763203138591
(0, 1048519) 0.00763203138591
<1x1048576 sparse matrix of type '<type 'numpy.float64'>'
with 17168 stored elements in Compressed Sparse Row format>
As you can see, the CountVectorizor, in binary mode, returns integer 1 for the value of every feature(we only expect to see 1 since there's only one document); the HashVectorizor on the other hand is returning floats (all the same, but different documents produce a different value). I suspect my issues stem from passing these floats onto BernoulliNB.
Ideally, I would like a way to get the same binary format data from HashingVectorizor as I get from CountVectorizor; failing that, I could use the BernoulliNB binarize parameter if I knew a sane threshold to set for this data, but I am not clear on what those floats represent (they're clearly not token counts, as they're all the same and less than 1).
Any help would be appreciated.
Solution
Under the default settings, HashingVectorizer
normalizes your feature vectors to unit Euclidean length:
>>> text = "foo bar baz quux bla"
>>> X = HashingVectorizer(n_features=8).transform([text])
>>> X.toarray()
array([[-0.57735027, 0. , 0. , 0. , 0.57735027,
0. , -0.57735027, 0. ]])
>>> scipy.linalg.norm(np.abs(X.toarray()))
1.0
Setting binary=True
only postpones this normalization until after binarizing the features, i.e. setting all the non-zero ones to one. You also have to set norm=None
to turn it off:
>>> X = HashingVectorizer(n_features=8, binary=True).transform([text])
>>> X.toarray()
array([[ 0.5, 0. , 0. , 0. , 0.5, 0.5, 0.5, 0. ]])
>>> scipy.linalg.norm(X.toarray())
1.0
>>> X = HashingVectorizer(n_features=8, binary=True, norm=None).transform([text])
>>> X.toarray()
array([[ 1., 0., 0., 0., 1., 1., 1., 0.]])
This is also why it's returning float
arrays: normalization requires them. While the vectorizer could be rigged to return another dtype, that would require conversion inside the transform
method and probably one back to float in the next estimator.
Answered By - Fred Foo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.