Issue
My question mainly comes from this post :https://stats.stackexchange.com/questions/53/pca-on-correlation-or-covariance
In the article, the author plotted the vector direction and length of each variable. Based on my understanding, after performing PCA. All we get are the eigenvectors and eigenvalues. For a dataset which has a dimension M x N, each eigenvalue should be a vector as 1 x N. So, my question is maybe the length of the vector is the eigenvalue, but how to find the direction of the vector for each variable mathematical? And what is the physical meaning of the length of the vector?
Also, if it is possible, can I do similar work with scikit PCA function in python?
Thanks!
Solution
This plot is called biplot and it is very useful to understand the PCA results. The length of the vectors it is just the values that each feature/variable has on each Principal Component aka PCA loadings.
Example:
These loadings as accessible through print(pca.components_)
. Using the Iris Dataset the loadings are:
[[ 0.52106591, -0.26934744, 0.5804131 , 0.56485654],
[ 0.37741762, 0.92329566, 0.02449161, 0.06694199],
[-0.71956635, 0.24438178, 0.14212637, 0.63427274],
[-0.26128628, 0.12350962, 0.80144925, -0.52359713]])
Here, each row is one PC and each column corresponds to one variable/feature. So feature/variable 1, has a value 0.52106591 on the PC1 and 0.37741762 on the PC2. These are the values used to plot the vectors that you saw in the biplot. See below the coordinates of Var1
. It's exactly those (above) values !!
Finally, to create this plot in python you can use this using sklearn
:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general it is a good idea to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
pca = PCA()
pca.fit(X,y)
x_new = pca.transform(X)
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
plt.scatter(xs ,ys, c = y) #without scaling
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function.
myplot(x_new[:,0:2], pca.components_.T)
plt.show()
See also this post: https://stackoverflow.com/a/50845697/5025009
and
Answered By - seralouk
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.