Issue
I'm doing cross fold validation in scikit-learn. Here the script:
import pandas as pd
import numpy as np
from time import time
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.metrics import classification_report, accuracy_score, make_scorer
from sklearn.model_selection._validation import cross_val_score
from sklearn.model_selection import GridSearchCV, KFold, StratifiedKFold
r_filenameTSV = "TSV/A19784.tsv"
#DF 300 dimension start
tsv_read = pd.read_csv(r_filenameTSV, sep='\t', names=["vector"])
df = pd.DataFrame(tsv_read)
df = pd.DataFrame(df.vector.str.split(" ", 1).tolist(), columns=['label', 'vector'])
print(df)
#DF 300 dimension end
y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1).ravel()
print(y.shape)
X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)
start = time()
clf = svm.SVC(kernel='rbf',
C=32,
gamma=8,
)
print("K-Folds scores:")
originalclass = []
predictedclass = []
def classification_report_with_accuracy_score(y_true, y_pred):
originalclass.extend(y_true)
predictedclass.extend(y_pred)
return accuracy_score(y_true, y_pred) # return accuracy score
inner_cv = StratifiedKFold(n_splits=10)
outer_cv = StratifiedKFold(n_splits=10)
# Nested CV with parameter optimization
nested_score = cross_val_score(clf, X=X, y=y, cv=outer_cv,
scoring=make_scorer(classification_report_with_accuracy_score))
# Average values in classification report for all folds in a K-fold Cross-validation
print(classification_report(originalclass, predictedclass))
print("10 folds processing seconds: {}".format(time() - start))
As you can see I'm using as input data a Pandas data frame which has 300 features.
How to reduce the feature from 300 to 100?
Everything has to be done in Pandas (i.e creating a df with max 100 features per record) or I can use directly scikit-learn?
Solution
there are many ways to reduce the number of features in ML models here are some of them
- use statistical methods such as Information Gain and Fisher Score, compute this score between your features and target and then select top 100
- remove constant or quasi constant features
- There are wrapper methods such as forward feature selection and backward feature selection and their idea is to search feature space and choose the best combination for this method you can use mlxtend.feature_selection this package is rather compatible with scikit learn
- use PCA, LDA, ....
- you can use embedded methods such as Lasso, Ridge or Random forest use this module from scikit learn: sklearn.feature_selection and import SelectFromModel
Answered By - h a
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.