Friday, October 8, 2021

[FIXED] How to reduce the number of vector features?

October 08, 2021 pandas, python, scikit-learn No comments

Issue

I'm doing cross fold validation in scikit-learn. Here the script:

import pandas as pd
import numpy as np
from time import time
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.metrics import classification_report, accuracy_score, make_scorer
from sklearn.model_selection._validation import cross_val_score
from sklearn.model_selection import GridSearchCV, KFold, StratifiedKFold

r_filenameTSV = "TSV/A19784.tsv"

#DF 300 dimension start

tsv_read = pd.read_csv(r_filenameTSV, sep='\t', names=["vector"])

df = pd.DataFrame(tsv_read)

df = pd.DataFrame(df.vector.str.split(" ", 1).tolist(), columns=['label', 'vector'])

print(df)

#DF 300 dimension end


y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1).ravel()
print(y.shape)

X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)

start = time()

clf = svm.SVC(kernel='rbf',
              C=32,
              gamma=8,
              )

print("K-Folds scores:")


originalclass = []
predictedclass = []


def classification_report_with_accuracy_score(y_true, y_pred):
    originalclass.extend(y_true)
    predictedclass.extend(y_pred)
    return accuracy_score(y_true, y_pred)  # return accuracy score


inner_cv = StratifiedKFold(n_splits=10)
outer_cv = StratifiedKFold(n_splits=10)


# Nested CV with parameter optimization
nested_score = cross_val_score(clf, X=X, y=y, cv=outer_cv,
                               scoring=make_scorer(classification_report_with_accuracy_score))

# Average values in classification report for all folds in a K-fold Cross-validation  
print(classification_report(originalclass, predictedclass))
print("10 folds processing seconds: {}".format(time() - start))

As you can see I'm using as input data a Pandas data frame which has 300 features.

How to reduce the feature from 300 to 100?

Everything has to be done in Pandas (i.e creating a df with max 100 features per record) or I can use directly scikit-learn?

Solution

there are many ways to reduce the number of features in ML models here are some of them

use statistical methods such as Information Gain and Fisher Score, compute this score between your features and target and then select top 100
remove constant or quasi constant features
There are wrapper methods such as forward feature selection and backward feature selection and their idea is to search feature space and choose the best combination for this method you can use mlxtend.feature_selection this package is rather compatible with scikit learn
use PCA, LDA, ....
you can use embedded methods such as Lasso, Ridge or Random forest use this module from scikit learn: sklearn.feature_selection and import SelectFromModel

Answered By - h a

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, October 8, 2021

[FIXED] How to reduce the number of vector features?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels