Issue
I found that to use PCA it is necessary to indicate at the beginning the number of components to be kept such as in the following code:
model = pca(n_components=3, normalize=True)
Is there any way to indicate only the variance and let the algorithm give me the most important components?
Solution
You don't necessarily need to specify the number of components in advance. You can extract all components and keep only the ones that explain a given fraction of the cumulative variance. See the code below for an example.
import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import make_spd_matrix
from sklearn.preprocessing import StandardScaler
# generate the data
np.random.seed(100)
N = 1000 # number of samples
K = 10 # number of features
mean = np.zeros(K)
cov = make_spd_matrix(K)
X = np.random.multivariate_normal(mean, cov, N)
print(X.shape)
# (1000, 10)
# rescale the data
scaler = StandardScaler()
X = scaler.fit_transform(X)
# perform the PCA
pca = PCA(n_components=None)
pca.fit(X)
# extract the smallest number of components which
# explain at least p% (e.g. 80%) of the variance
p = 0.80
n_components = 1 + np.argmax(np.cumsum(pca.explained_variance_ratio_) >= p)
print(n_components)
# 6
# extract the values of the selected components
Z = pca.transform(X)[:, :n_components]
print(Z.shape)
# (1000, 6)
Answered By - Flavia Giammarino
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.