Issue
My question is how to take the value of n_components in PCA (n_components=?). The background of the project is to use machine learning algorithms to predict the stage of the disease. I am using sklearn.
Examples in my project:
PCA (n_components=0.95), the accuracy rate is 0.72. It generated 53 new components.
PCA (n_components=0.55), the accuracy rate is 0.78. It generated 5 new components.
svm_clf04 = SVC(kernel="linear", random_state=42)
start = time.process_time()
# Feature scaling
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(rfecv_forest01_x_train01)
# Dimension reduction
pca = PCA(n_components=0.95, svd_solver='full') # n_components
x_train_scaled_reduced = pca.fit_transform(x_train_scaled)
print (pca.explained_variance_ratio_)
print (pca.explained_variance_)
print ("Components:",pca.n_components_)
svm_clf04.fit(x_train_scaled_reduced, y_train01)
pred = cross_val_predict(svm_clf04, x_train_scaled_reduced, y_train01, cv=10)
print("Time: ", time.process_time() - start)
print(confusion_matrix(y_train01, pred))
print(classification_report(y_train01, pred))
For explained variance, some people on the Internet say that 0.95 is the best choice. But if I reduce the explained variance, the accuracy will increase. How should I choose? An explained variance of 0.95 or higher accuracy.
Solution
I'm not sure you're using PCA correctly. If you look at the docs you see that it can correctly interpret float value between 0 and 1 when solver is full
(I'm assuming this is scikit-learn
):
If 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.
At the same time the default solver is auto
. I would suggest re-running the PCA while explicitly specifying PCA(n_components=0.95, svd_solver='full')
Secondly, the 0.95 is not the "best choice", I'm not sure why anyone would suggest it. The choice of the number of PCAs depends on the problem at hand, i.e. if you do PCA to be able to plot multidimensional data, then you will only want to leave 2 or 3 PCA; in most other applications the problem will define how much of the variance of the data are you ready to forgo for the sake of simplicity.
Another option is plotting combined explained variance for 1, 2, 3... etc PCA and selecting the point where graph makes a 'kink' so that adding more PCA barely increases the overall explained variance, i.e like this:
Answered By - pavel
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.