Issue
I am using RFECV for feature selection in scikit-learn. I would like to run an XGBoost model with log(y) because I have been able to demonstrate that this is performing better than just y.
Simple model without transformation: no problem, RFECV works fine and I can get the number of features.
Log-transformed model = problem: I have an error saying:
"ValueError: Input X contains NaN; RFECV does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values"
What I don't understand is that I do not have a NaN issue with the simple model but I do with the log-transformed one. I do not have NaN in the target y.
How can I solve my problem and be able to run RFECV with a log-transformed target?
# Base estimator
rs = 45
xgboost_reg = xgb.XGBRegressor(random_state = rs,
grow_policy = "depthwise",
booster = "gbtree", # gblinear or dart; gbtree and dart use tree based models while gblinear uses linear functions.
tree_method = "auto", # pick best option between hist, exact and approx
n_estimators = randint(300,500).rvs(random_state = rs),
subsample = uniform(0.5, 0.5).rvs(random_state = rs),
max_depth = randint(3,10).rvs(random_state = rs),
learning_rate = loguniform(0.05, 0.2).rvs(random_state = rs),
colsample_bytree = uniform(0.5, 0.5).rvs(random_state = rs),
min_child_weight = randint(1,20).rvs(random_state = rs),
gamma = uniform(0.5, 1).rvs(random_state = rs),
reg_alpha = uniform(0.0, 1.0).rvs(random_state = rs),
reg_lambda = uniform(0.0, 1.0).rvs(random_state = rs),
max_delta_step = randint(1,10).rvs(random_state = rs)
)
# RFECV settings
n_features = 89
step = 20
n_scores = 2
min_features_to_select = 9
# Simple model = working
rfecv = RFECV(
xgboost_reg,
step=step,
cv=4,
scoring="neg_root_mean_squared_error",
min_features_to_select= min_features_to_select,
n_jobs=-1,
)
rfecv.fit(x, y)
print(rfecv.n_features_)
# Log-transformed model = error
log_estimator = TransformedTargetRegressor(regressor=xgboost_reg,
func=np.log,
inverse_func=np.exp)
rfecv_log = RFECV(
estimator= log_estimator,
step=step,
cv=4,
scoring="neg_root_mean_squared_error",
min_features_to_select= min_features_to_select,
n_jobs=-1,
)
rfecv_log.fit(x, y)
print(rfecv_log.n_features_)
Solution
Revised Answer: The error is due to x containing Nan values. The problem can be resolved by updating scikit-learn to the current version 1.4.0, which allows Nan values in RFECV.
Note: The original suggestion was to check for negative values in the target, as np.log()
for negative numbers produces nan values, which was not the cause of the problem.
Answered By - Wieland
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.