Issue
I just started learning Catboost and tried to use CatboostRegressor with StratifiedKFold, but ran into error:
Here is the edited post with full block of codes and error for clarification. In addition, also tried with for i, (train_index, test_index) in enumerate(fold.split(X,y)): did not work though.
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.metrics import mean_squared_log_error
from sklearn.preprocessing import LabelEncoder
from catboost import Pool, CatBoostRegressor
fold=StratifiedKFold(n_splits=5,shuffle=True,random_state=42)
err = []
y_pred = []
for train_index, test_index in fold.split(X,y):
#for i, (train_index, test_index) in enumerate(fold.split(X,y)):
X_train, X_val = X.iloc[train_index], X.iloc[test_index]
y_train, y_val = y[train_index], y[test_index]
_train = Pool(X_train, label = y_train)
_valid = Pool(X_val, label = y_val)
cb = CatBoostRegressor(n_estimators = 20000,
reg_lambda = 1.0,
eval_metric = 'RMSE',
random_seed = 42,
learning_rate = 0.01,
od_type = "Iter",
early_stopping_rounds = 2000,
depth = 7,
cat_features = cate,
bagging_temperature = 1.0)
cb.fit(_train,cat_features=cate,eval_set = _valid, early_stopping_rounds = 2000, use_best_model = True, verbose_eval = 100)
p = cb.predict(X_val)
print("err: ",rmsle(y_val,p))
err.append(rmsle(y_val,p))
pred = cb.predict(test_df)
y_pred.append(pred)
predictions = np.mean(y_pred,0)
ValueError Traceback (most recent call last)
<ipython-input-21-3a0df0c7b8d6> in <module>()
7 err = []
8 y_pred = []
----> 9 for train_index, test_index in fold.split(X,y):
10 #for i, (train_index, test_index) in enumerate(fold.split(X,y)):
11 X_train, X_val = X.iloc[train_index], X.iloc[test_index]
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site- packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
333 .format(self.n_splits, n_samples))
334
--> 335 for train, test in super().split(X, y, groups):
336 yield train, test
337
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site- packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
87 X, y, groups = indexable(X, y, groups)
88 indices = np.arange(_num_samples(X))
---> 89 for test_index in self._iter_test_masks(X, y, groups):
90 train_index = indices[np.logical_not(test_index)]
91 test_index = indices[test_index]
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups)
684
685 def _iter_test_masks(self, X, y=None, groups=None):
--> 686 test_folds = self._make_test_folds(X, y)
687 for i in range(self.n_splits):
688 yield test_folds == i
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sklearn/model_selection/_split.py in _make_test_folds(self, X, y)
639 raise ValueError(
640 'Supported target types are: {}. Got {!r instead.'.format(
--> 641 allowed_target_types, type_of_target_y))
642
643 y = column_or_1d(y)
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
Solution
You get the error for a very fundamental reason from basic ML theory: stratification is defined only for classification, in order to ensure equal representation of all classes in the split; it is meaningless in regression. Reading closely the error message, you should be able to convince yourself that its meaning is that 'continous'
targets (i.e. regression) are not supported, only 'binary'
or 'multiclass'
(i.e. classification); and this is not some peculiarity of scikit-learn, but a fundamental issue indeed.
A relevant hint is also included in the documentation (emphasis added):
Stratified K-Folds cross-validator
Provides train/test indices to split data in train/test sets.
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
Here is a short demonstration, adapting the example from the documentation, but changing the targets y
to be continuous (regression) instead of discrete (classification):
import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0.1, 0.5, -1.1, 1.2]) # continuous targets, i.e. regression problem
skf = StratifiedKFold(n_splits=2)
for train_index, test_index in skf.split(X,y):
print("something")
[...]
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
So, simply speaking, you cannot actually use StratifiedKFold
in your (regression) setting; change it to simple KFold
and move on from there...
Answered By - desertnaut
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.