Issue
I used sk-learn.CountVectorizer() to create the vector matrix and found it's 57% filled with 0s. In some online cases their sparse matrix has only 30% filled with zeros. I wish to know the impact of level of sparsity. Is it better or worse or no difference to have less zeros in the sparse matrix? What comments we can give on this observation?
Solution
In fact, 30% or even 57% of zeros does not mean high sparsity. So in your case, it is safe enough to just ignore the fact of sparsity and treat your matrix as if it was a dense one.
Really high sparsity is something like 99.99% of zeros. It occurs in problems like recommender systems, when there are thousands or even millions of items, but each user has interacted only with a few of them. Another case is when we have very short texts (e.g tweets or dialogue turns) and a very large vocabulary (maybe, even a multilingual one).
If the feature matrix has really high sparsity, it means that:
- If you want to store your matrix efficiently or make fast calculations with it, you may want to use an algorithm that explicitly supports
scipy
's sparse matrices. - The feature space is probably high-dimensional, and probably some features are highly correlated with each other. Therefore, you might find dimensionality reduction useful to make your model more tractable and generalize better. You can use matrix decomposition techniques (e.g. PCA) or a neural embedding layer to implement this dimensionality reduction. Or maybe you can use pre-trained word embeddings and somehow aggregate them to represent your document.
In general, the optimal way to represent your document depends on the ultimate problem you are trying to solve. For some problems (e.g. text classification with a large training set) a high-dimensional sparse representation might be optimal; for others (e.g. similarity of small texts or text classification with a small labeled training set) a low-dimensional dense representation would be better.
Answered By - David Dale
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.