Issue
I trained my model by using TfIdfVectorizer and MultinomialNB and I saved it into a pickle file.
Now that I am trying to use the classifier from another file to predict in unseen data, I cannot do it because it is telling my that the number of features of the classifier is not the same than the number of features of my current corpus.
This is the code where I am trying to predict. The function do_vectorize is exactly the same used in training.
def do_vectorize(data, stop_words=[], tokenizer_fn=tokenize):
vectorizer = TfidfVectorizer(stop_words=stop_words, tokenizer=tokenizer_fn)
X = vectorizer.fit_transform(data)
return X, vectorizer
# Vectorizing the unseen documents
matrix, vectorizer = do_vectorize(corpus, stop_words=stop_words)
# Predicting on the trained model
clf = pickle.load(open('../data/classifier_0.5_function.pkl', 'rb'))
predictions = clf.predict(matrix)
However I receive the error that the number of features are different
ValueError: Expected input with 65264 features, got 472546 instead
This means I also have to save my vocabulary from training in order to test? What will happen if there are terms that did not exist on training?
I tried to used pipelines from scikit-learn with the same vectorizer and classifier, and the same parameters for both. However, it turned too slow from 1 hour to more than 6 hours, so I prefer to do it manually.
Solution
This means I also have to save my vocabulary from training in order to test?
Yes, you have to save whole tfidf vectorizer, which in particular means saving vocabulary.
What will happen if there are terms that did not exist on training?
They will be ignored, which makes perfect sense since you have no training data about this, thus there is nothing to take into consideration (there are more complex methods which could still use it, but they do not use such simple approaches as tfidf).
I tried to used pipelines from scikit-learn with the same vectorizer and classifier, and the same parameters for both. However, it turned too slow from 1 hour to more than 6 hours, so I prefer to do it manually.
There should be little to no overhead when using pipelines, however doing things manually is fine as long as you remember to store vectorizer as well.
Answered By - lejlot
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.