Issue
I have a vocabulary text file where each line is a word. Few words from vocabulary are shown below:
AccountsAndTransactions_/get/v2/accounts/details_DELETE
AccountsAndTransactions_/get/v2/accounts/details_GET
AccountsAndTransactions_/get/v2/accounts/details_POST
AccountsAndTransactions_/get/v2/accounts/{accountId}/transactions_DELETE
AccountsAndTransactions_/get/v2/accounts/{accountId}/transactions_GET
AccountsAndTransactions_/get/v2/accounts/{accountId}/transactions_POST
Important: AccountsAndTransactions_/get/v2/accounts/details_DELETE
this is a single word in this problem.
Reading vocabulary from text file:
with open(Path(VOCAB_FILE), "r") as f:
vocab = f.read().splitlines()
Generating doc_paths
:
doc_paths = [f for f in listdir(DOC_DIR) if isfile(join(DOC_DIR, f))]
r = re.compile(".*txt")
doc_paths = list(filter(r.match, doc_paths))
doc_paths = [Path(join(DOC_DIR, i)) for i in doc_paths]
I am running CountVectorizer
on documents.
tf_vectorizer = CountVectorizer(input='filename', lowercase=False, vocabulary=vocab)
tf = tf_vectorizer.fit_transform(doc_paths) # doc_paths is list of pathlib.Path(...) object.
X = tf.toarray() # returns zero matrix
The issue is all the values in X
are zero. (The corpus-documents are not empty.)
Could someone help me? I want the term frequency of every word in vocabulary for each document.
Solution
I solved this problem by overriding default analyzer
of CountVectorizer
:
def analyzer_custom(doc):
return doc.split()
tf_vectorizer = CountVectorizer(input='filename',
lowercase=False,
vocabulary=vocab,
analyzer=analyzer_custom)
Thanks to @Chris for explaining internal details of CountVectorizer.
Answered By - Kaushal Kishore
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.