Tuesday, August 2, 2022

[FIXED] How to get contributions and squared cosines in sklearn PCA?

August 02, 2022 pca, python, scikit-learn No comments

Issue

Working primarily based on this paper I want to implement the various PCA interpretation metrics mentioned - for example cosine squared and what the article calls contribution.

However the nomenclature here seems very confusing, namely it's not clear to me what exactly sklearns pca.components_ is. I've seen some answers here and in various blogs stating that these are loadings while others state it's component scores (which I assume is the same thing as factor scores).

The paper defines contribution (of observation to component) as:

$ctr_{i,l}=\frac{f_{i,l}^2}{\lambda_l}$

and states all contributions for each component must add to 1, which is not the case assuming pca.explained_variance_ is the eigenvalues and pca.components_ are the factor scores:

df = pd.DataFrame(data = [
[0.273688,0.42720,0.65267],
[0.068685,0.008483,0.042226],
[0.137368, 0.025278,0.063490],
[0.067731,0.020691,0.027731],
[0.067731,0.020691,0.027731]
], columns = ["MeS","EtS", "PrS"])

pca = PCA(n_components=2)
X = pca.fit_transform(df)
ctr=(pd.DataFrame(pca.components_.T**2)).div(pca.explained_variance_)
np.sum(ctr,axis=0)
# Yields random values 0.498437 and 0.725048

How can I calculate these metrics? The paper defines cosine squared similarly as:

$cos^2_{i,l}=\frac{f_{i,l}^2}{d^2_{i,g}}$

Solution

This paper does not play well with sklearn as far as definitions are concerned.

The pca.components_ are the two principal components of your data after your data is centered. And pca.fit_transform(df) gives you the components of your centered data set w.r.t. those two principal components, i.e., the factor scores.

> pca.fit_transform(df)
array([[ 0.60781787, -0.00280834],
       [-0.1601333 , -0.01246807],
       [-0.11667497,  0.04584743],
       [-0.1655048 , -0.01528551],
       [-0.1655048 , -0.01528551]])

Next, the lambda_l of equation (10) in the paper is just the sum of the squares of the factor scores for the l-th component, i.e. l-th column of pca.fit_transform(df). But pca.explained_variance_ gives you the two variances, and since sklearn uses as degrees of freedom the value len(df.index) - 1, we have lambda_l == (len(df.index)-1) pca.explained_variance_[l].

> X = pca.fit_transform(df)
> lmbda = np.sum(X**2, axis = 0)
> lmbda
array([0.46348196, 0.00273262])

> (5-1) * pca.explained_variance_
array([0.46348196, 0.00273262])

Thus, as a summary, I recommend computing the contributions as:

> ctr = X**2 / np.sum(X**2, axis = 0)

For the squared cosine it's the same except that we sum over the rows of pca.fit_transform(df):

> cos_sq = X**2 / np.sum(X**2, axis = 1)[:, np.newaxis]

Answered By - frank

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, August 2, 2022

[FIXED] How to get contributions and squared cosines in sklearn PCA?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels