Issue
I have a dataset that contains 17 features (x) and binary classification results (y). I already prepared the dataset and performed train_test_split()
on it. I'm using the following script to run different ML algorithms on the dataset to compare between them:
def run_exps(X_train: pd.DataFrame , y_train: pd.DataFrame, X_test: pd.DataFrame, y_test: pd.DataFrame) -> pd.DataFrame:
# Lightweight script to test many models and find winners
# :param X_train: training split
# :param y_train: training target vector
# :param X_test: test split
# :param y_test: test target vector
# :return: DataFrame of predictions
models = [
('LogReg', LogisticRegression()),
('RF', RandomForestClassifier()),
('KNN - Euclidean', KNeighborsClassifier(metric='euclidean')),
('SVM', SVC()),
('XGB', XGBClassifier(use_label_encoder =False, eval_metric='error'))
]
names = []
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc']
# For Loop that takes each model and perform training, cross validation, prediction and evaluation
for name, model in models:
# Making pipleline that normalize, oversmaple the dataset
pipe = Pipeline([
('normalization', MinMaxScaler()),
('oversampling', SMOTE())
])
kfold = StratifiedKFold(n_splits=5)
# How can I call the pipeline inside the cross_validate() Function ?
cv_results = cross_validate(model, X_train, y_train, cv=kfold, scoring=scoring, verbose=3)
clf = model.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('''
{}
{}
{}
''' .format(name, classification_report(y_test, y_pred), confusion_matrix(y_test, y_pred)))
names.append(name)
I have noticed that the data that I'm using needs to be normalized and oversampled before I run the script.
However, since I'm using cross_validate()
function inside the script, I need to perform normalization and oversampling with each fold.
In order to do so I have created a pipeline (that normalizes and oversamples the dataset) inside the for loop (that takes each model and perform training, cross validation, prediction and evaluation) but I'm not sure how to call the pipeline since the estimator
parameter in cross_validate()
already takes the model
variable to perform the prediction based on it.
What should I do in this case ?
Solution
You could integrate your model within your pipeline and then call cross_validate
on your pipeline as follow:
pipe = Pipeline([
('normalization', MinMaxScaler()),
('oversampling', SMOTE()),
('name', model)
])
cv_results = cross_validate(pipe, X_train, y_train, cv=kfold, scoring=scoring, verbose=3)
Answered By - Antoine Dubuis
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.