Issue
I got a word2vec model abuse_model
trained by Gensim. I want to apply PCA and make a plot on CERTAIN words that I only care about (vs. all words in the model). Therefore, I created a dict d
whose keys are words that I care about and the values are vectors to the key.
vocab = list(abuse_model.wv.key_to_index)
vocab = [v for v in vocab if v in positive_terms]
d = {}
for word in vocab:
d[word] = abuse_model.wv[word]
No errors so far.
I encountered an error when passing the dict into pca.fit_transform
. I'm new to it and am wondering if the data format that I passed in (list of tuples) is not correct. What data type that the argument has to be?
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
result = pca.fit_transform(list(d.items()))
Thanks in advance!
Solution
Per scikit-learn
docs – https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit_transform – the argument to .fit_transform()
, as is usual for scikit-learn
models, is "array-like of shape (n_samples, n_features)".
Here, that'd mean your samples/rows are words, and features/columns the word-vector dimensions. And, you'll want to remember outside of the PCA
object which words correspond to which rows. (In Python 3.x, the fact your d
dict
will always iterate in the order of insertion should have you covered there.)
So, it may be enough to change your use of .items()
to .values()
, so that you wind up supplying PCA
with your list
(which is suitably array-like) of vectors.
A few other notes:
- the
.key_to_index
property is already alist
, so you don't need to convert/copy it - if your
positive_terms
is a largelist
, changing it to aset
could offer fasterin
membership-testing - rather than using a
d
dict
, which involves a little more overhead (including when you then make alist
of its values), if your sets-of-words and vectors are large, you might want to preallocate anumpy
array of the right size and collect your vectors in it. For example:
X = np.empty((len(vocab), abuse_model.wv.vector_size)
for i, word in enumerate(vocab):
X[i] = abuse_model.wv[word]
#...
#...
result = pca.fit_transform(X)
- Even though your hunch is you only want the dimensionality-reduction on your subset of words, you may also want to try keeping all words, or some random subset of other words – it might help retain some of the original structure that otherwise, your subsampling will have prematurely removed. (Unsure of this; just noting it could be a factor.) Even if you do the PCA on a larger set of words, you could still choose to only later plot/analyze your desired subset for clarity.
Answered By - gojomo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.