Issue
I'm trying to understand what sklearn
is doing when running a PCA
. Unfortunately I don't have much knowledge with PCA
so it might be my understand is just wrong.
Let's have a simple example with the iris dataset:
iris = datasets.load_iris()
X = iris.data
pca.fit(X)
Xfit = pca.transform(X)
Xfit
now looks like this:
[[-2.68412563e+00, 3.19397247e-01, -2.79148276e-02, -2.26243707e-03], ...
I thought that to get these projected values I basically just need to build the dot product of the original values and the transposed basic vectors
/components
. So I assumed that this should give the same result:
np.dot(X, np.transpose(pca.components_))
But unfortunately this is the result:
[[ 2.81823951e+00, 5.64634982e+00, -6.59767544e-01, 3.10892758e-02],..
So my question is:
Why is there a difference? I asume the one from pca.transform(X)
is correct and I'm doing something wrong but what would I need to do if I only have the components and would like to calculate the principal component values myselfs?
Solution
Alright, I've found the issue. I have to mean-center the raw values before applying np.dot
. So when using only pd.DataFrame
, which makes mean-centering pretty easy, it looks like this:
np.dot(pd.DataFrame(X)-pd.DataFrame(X).mean(), np.transpose(pd.DataFrame(pca.components_)))
and the results are the same as when using the fit function:
[[-2.68412563e+00, 3.19397247e-01, -2.79148276e-02, -2.26243707e-03], ...
Answered By - spcial
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.