Issue
I'm wondering if it is possible for Sklearn's RFECV to select a fixed number of the most important features. For example, working on a dataset with 617 features, I have been trying to use RFECV to see which 5 of those features are the most significant. However, RFECV does not have the parameter 'n_features_to_select', unlike RFE (which confuses me). How should I deal with this?
Solution
According to this quora post
The RFECV object helps to tune or find this n_features parameter using cross-validation. For every step where "step" number of features are eliminated, it calculates the score on the validation data. The number of features left at the step which gives the maximum score on the validation data, is considered to be "the best n_features" of your data.
Which says RFECV determines the optimal number of features (n_features) to get best result.
The fitted RFECV
object contains an attribute ranking_
with feature ranking, and support_
mask to select optimal features found.
However if you MUST select top n_features from RFECV you can use the ranking_
attribute
optimal_features = X[:, selector.support_] # selector is a RFECV fitted object
n = 6 # to select top 6 features
feature_ranks = selector.ranking_ # selector is a RFECV fitted object
feature_ranks_with_idx = enumerate(feature_ranks)
sorted_ranks_with_idx = sorted(feature_ranks_with_idx, key=lambda x: x[1])
top_n_idx = [idx for idx, rnk in sorted_ranks_with_idx[:n]]
top_n_features = X[:5, top_n_idx]
Reference: sklearn documentation, Quora post
Answered By - shanmuga
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.