Issue
What is the right way to implement SMOTE()
in a classificaiton modeling process? I am really confused about how to apply SMOTE()
there. Say I have the dataset split into train and test like this as a starter:
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.model_selection import GridSearchCV, train_test_split
# Some dataset initialization
X = df.drop(['things'], axis = 1)
y = df['things']
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
# SMOTE() on the train dataset:
X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train, random_state=42)
After applying SMOTE()
on the train dataset for classification problem above, My questions are:
- Should I apply
SMOTE()
inside the pipeline after splitting the dataset above like this?:
# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()),
('over', SMOTE(random_state = 42)),
('model', LogisticRegression(random_state = 42))])
# Then do model evaluation with Repeated Stratified KFold,
# Then do Grid Search for hyperparameter tuning
# Then do the actual model testing with unseen X_test (Like this):
cv = RepeatedStratifiedKFold(n_splits = 10, n_repeats = 3, random_state = 42)
params = {'model__penalty': ['l1', 'l2'],
'model__C':[0.001, 0.01, 0.1, 5, 10, 100]}
grid = GridSearchCV(estimator = pipeline,
param_grid = params,
scoring = 'roc_auc',
cv = cv,
n_jobs = -1)
grid.fit(X_train_smote, y_train_smote)
cv_score = grid.best_score_
test_score = grid.score(X_test, y_test)
print(f"Cross-validation score: {cv_score} \n Test Score: {test_score}")
- Or, should I apply pipeline without calling
SMOTE()
like this?
# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()),
('model', LogisticRegression(random_state = 42))])
# Same process as above for modeling, evaluation, etc...
- Or, should I use
SMOTE()
like this without using the SMOTE'd data like this:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()),
('over', SMOTE(random_state = 42)),
('model', LogisticRegression(random_state = 42))])
# Same process as above for modeling, evaluation, etc...
# BUT!, when fitting grid.fit(), we do this?:
grid.fit(X_train, y_train)
- Or use
SMOTE()
train data inside the Pipeline of Sklearn like this?:
X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train, random_state=42)
pipeline = Pipeline(steps = [('scale', StandardScaler()),
('model', LogisticRegression(random_state = 42))])
# Same process as above for modeling, evaluation, etc...
# BUT!, when fitting grid.fit(), we do this?:
grid.fit(X_train_smote, y_train_smote)
Solution
In general, you want to SMOTE the training data but not the validation or test data. So if you want to use folded cross-validation, you cannot SMOTE the data before sending it in to that process.
- No, you are running SMOTE twice (before and inside the pipeline). Also, you have SMOTEd points in the validation folds, which you don't want.
- No, you will have SMOTEd points in the validation folds.
- This is the way to do it.
- No, you will have SMOTEd points in the validation folds.
I recommend looking at sklearn.metrics.roc_auc_score()
as well as whatever other metrics you use, because it can reveal issues caused by improperly splitting resampled data. (SMOTEd points can be very predictable, but do not improve the AUC.)
Answered By - Matt Hall
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.