Issue
Consider:
from sklearn.decomposition import PCA
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def do_pca(X, n_components):
print(f"Doing PCA from {X.shape[0]} vectors")
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X)
print('Explained variance:',
sum(pca.explained_variance_ratio_[:n_components]))
return pca
def by_relevance(vectors, key_vector):
rankings = [
(i, cosine_similarity(v.reshape(1,-1), key_vector.reshape(1,-1)))
for i, v in enumerate(vectors)]
rankings.sort(key=lambda el:-el[1])
for i, r in rankings:
print(i, r)
np.random.seed(1)
X = np.random.random((50, 20))
pca = do_pca(X, n_components=20)
X = X[:5]
key_vector = X[[0]]
by_relevance(X, key_vector)
print()
by_relevance(pca.transform(X), pca.transform(key_vector))
This code performs PCA on 50 vectors, while keeping as many components as there are dimensions. The function by_relevance
sorts vectors by similarity to the given vector. We call this function twice - once before and once after the transformation. Since all the components are kept, I would expect similar results for both invocations. However, this is the output:
Doing PCA from 50 vectors
Explained variance: 1.0
0 [[1.]]
4 [[0.78738484]]
1 [[0.73532448]]
3 [[0.71538191]]
2 [[0.6614021]]
0 [[1.]]
4 [[0.01659682]]
3 [[-0.02417426]]
1 [[-0.02855172]]
2 [[-0.03232985]]
Why is the ranking affected and how come the last four similarities became so small?
Solution
The cosine similarity between random samples should be close to zero. The reason why your cosine similarity is so high for your un-transformed data, is that all points/vectors are lying the first quadrant. If you de-mean your data before computing the PCs, the results are much more similar between the two conditions, namely a cosine similarity of 1 for the matching vector and a cosine similarity of zero everywhere else.
import numpy as np
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
np.random.seed(1)
X = np.random.random((50, 20))
X -= np.full_like(X, 0.5)
pca = PCA(n_components=20).fit(X)
X = X[:5]
key_vector = X[[0]]
print(cosine_similarity(X, key_vector))
print(cosine_similarity(pca.transform(X), pca.transform(key_vector)))
# [[ 1. ]
# [-0.01928803]
# [ 0.0100122 ]
# [-0.06792705]
# [ 0.01940078]]
# [[ 1. ]
# [-0.02855172]
# [-0.03232985]
# [-0.02417426]
# [ 0.01659682]]
Answered By - Paul Brodersen
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.