Thursday, December 21, 2023

[FIXED] PCA affects similarity comparisons even when keeping all the dimensions

December 21, 2023 pca, python, scikit-learn No comments

Issue

Consider:

from sklearn.decomposition import PCA
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def do_pca(X, n_components):
    print(f"Doing PCA from {X.shape[0]} vectors")
    pca = PCA(n_components=n_components)
    X_pca = pca.fit_transform(X)
    print('Explained variance:',
          sum(pca.explained_variance_ratio_[:n_components]))
    return pca

def by_relevance(vectors, key_vector):
    rankings = [
        (i, cosine_similarity(v.reshape(1,-1), key_vector.reshape(1,-1)))
        for i, v in enumerate(vectors)]
    rankings.sort(key=lambda el:-el[1])
    for i, r in rankings:
        print(i, r)

np.random.seed(1)
X = np.random.random((50, 20))
pca = do_pca(X, n_components=20)

X = X[:5]
key_vector = X[[0]]
by_relevance(X, key_vector)
print()
by_relevance(pca.transform(X), pca.transform(key_vector))

This code performs PCA on 50 vectors, while keeping as many components as there are dimensions. The function by_relevance sorts vectors by similarity to the given vector. We call this function twice - once before and once after the transformation. Since all the components are kept, I would expect similar results for both invocations. However, this is the output:

Doing PCA from 50 vectors
Explained variance: 1.0
0 [[1.]]
4 [[0.78738484]]
1 [[0.73532448]]
3 [[0.71538191]]
2 [[0.6614021]]

0 [[1.]]
4 [[0.01659682]]
3 [[-0.02417426]]
1 [[-0.02855172]]
2 [[-0.03232985]]

Why is the ranking affected and how come the last four similarities became so small?

Solution

The cosine similarity between random samples should be close to zero. The reason why your cosine similarity is so high for your un-transformed data, is that all points/vectors are lying the first quadrant. If you de-mean your data before computing the PCs, the results are much more similar between the two conditions, namely a cosine similarity of 1 for the matching vector and a cosine similarity of zero everywhere else.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

np.random.seed(1)
X = np.random.random((50, 20))
X -= np.full_like(X, 0.5)
pca = PCA(n_components=20).fit(X)

X = X[:5]
key_vector = X[[0]]

print(cosine_similarity(X, key_vector))
print(cosine_similarity(pca.transform(X), pca.transform(key_vector)))

# [[ 1.        ]
#  [-0.01928803]
#  [ 0.0100122 ]
#  [-0.06792705]
#  [ 0.01940078]]
# [[ 1.        ]
#  [-0.02855172]
#  [-0.03232985]
#  [-0.02417426]
#  [ 0.01659682]]

Answered By - Paul Brodersen

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, December 21, 2023

[FIXED] PCA affects similarity comparisons even when keeping all the dimensions

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels