Issue
Working primarily based on this paper I want to implement the various PCA interpretation metrics mentioned - for example cosine squared and what the article calls contribution.
However the nomenclature here seems very confusing, namely it's not clear to me what exactly sklearns pca.components_
is. I've seen some answers here and in various blogs stating that these are loadings while others state it's component scores (which I assume is the same thing as factor scores).
The paper defines contribution (of observation to component) as:
and states all contributions for each component must add to 1, which is not the case assuming pca.explained_variance_
is the eigenvalues and pca.components_
are the factor scores:
df = pd.DataFrame(data = [
[0.273688,0.42720,0.65267],
[0.068685,0.008483,0.042226],
[0.137368, 0.025278,0.063490],
[0.067731,0.020691,0.027731],
[0.067731,0.020691,0.027731]
], columns = ["MeS","EtS", "PrS"])
pca = PCA(n_components=2)
X = pca.fit_transform(df)
ctr=(pd.DataFrame(pca.components_.T**2)).div(pca.explained_variance_)
np.sum(ctr,axis=0)
# Yields random values 0.498437 and 0.725048
How can I calculate these metrics? The paper defines cosine squared similarly as:
Solution
This paper does not play well with sklearn as far as definitions are concerned.
The pca.components_
are the two principal components of your data after your data is centered. And pca.fit_transform(df)
gives you the components of your centered data set w.r.t. those two principal components, i.e., the factor scores.
> pca.fit_transform(df)
array([[ 0.60781787, -0.00280834],
[-0.1601333 , -0.01246807],
[-0.11667497, 0.04584743],
[-0.1655048 , -0.01528551],
[-0.1655048 , -0.01528551]])
Next, the lambda_l of equation (10) in the paper is just the sum of the squares of the factor scores for the l-th component, i.e. l-th column of pca.fit_transform(df)
. But pca.explained_variance_
gives you the two variances, and since sklearn uses as degrees of freedom the value len(df.index) - 1
, we have lambda_l == (len(df.index)-1) pca.explained_variance_[l]
.
> X = pca.fit_transform(df)
> lmbda = np.sum(X**2, axis = 0)
> lmbda
array([0.46348196, 0.00273262])
> (5-1) * pca.explained_variance_
array([0.46348196, 0.00273262])
Thus, as a summary, I recommend computing the contributions as:
> ctr = X**2 / np.sum(X**2, axis = 0)
For the squared cosine it's the same except that we sum over the rows of pca.fit_transform(df)
:
> cos_sq = X**2 / np.sum(X**2, axis = 1)[:, np.newaxis]
Answered By - frank
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.