Monday, March 7, 2022

[FIXED] sklearn: Adding names to feature_importances_/coef_ when using ColumnTransformer and Pipeline

March 07, 2022 python, scikit-learn No comments

Issue

I am working on the Kaggl insurance pricing competition (https://www.kaggle.com/floser/french-motor-claims-datasets-fremtpl2freq). The data looks like this:

   IDpol    ClaimNb Exposure    Area    VehPower    VehAge  DrivAge BonusMalus  VehBrand    VehGas  Density Region
0   1.0        1        0.10       D           5         0       55        50        B12    Regular 1217    R82
1   3.0        1        0.77       D           5         0       55        50        B12    Regular 1217    R82
2   5.0        1        0.75       B           6         2       52        50        B12    Diesel  54      R22
3   10.0       1        0.09       B           7         0       46        50        B12    Diesel  76      R72
4   11.0       1        0.84       B           7         0       46        50        B12    Diesel  76      R72

For preprocessing etc. I am using a Pipeline including a ColumnTransformer:

pt_columns = ['BonusMalus', 'RBVAge']
log_columns = ['Density']
kbins_columns = ['VehAge','VehPower', 'DrivAge']
cat_columns = ['Area', 'Region', 'VehBrand', 'VehGas']

X_train['RBVAge'] = 0
X_train.loc[(X_train['VehGas'] == 'Regular') & (X_train['VehBrand'] == 'B12') & (X_train['VehAge'] == 0), 'RBVAge'] = 1

ct = ColumnTransformer([('pt', 'passthrough', pt_columns),
                        ('log', FunctionTransformer(np.log1p, validate=False), log_columns),
                        ('kbins', KBinsDiscretizer(), kbins_columns),
                       ('ohe', OneHotEncoder(), cat_columns)])
pipe_poisson_reg = Pipeline([('cf_trans', ct),
                      ('ssc', StandardScaler(with_mean = False)),
                      ('poisson_regressor', PoissonRegressor())])

Once the model is fitted I'd like to display the feature importances and names of the features like this:

name	feature_importances_
Area_A	0.25
Area_B	0.10
VehAge	0.30
...	...

The problem I am facing is to get the feature names while using a ColumnTransformer. Especially getting the feature names from the KBinsDiscretizer, which uses OneHot-encoding as well, does not work easily. What I have tried so far is creating a numpy array with the feature names manually, but as I have said I cannot manage to get the feature names from the KBinsDiscretizer and this solution does not seem very elegant.

columnNames = np.append(pt_columns, pipe['cftrans'].transformers_[1][1].get_feature_names(cat_columns))

Is there a simple (maybe even build in) way to create a DataFrame including both the feature names and feature importances?

Well and since we are already here (this might be a bit off topic): Is there a simple way to create a custom ColumnTransformer which adds the new column 'RBVAge', which I currently add manually?

Thanks in advance

Solution

This will be a whole lot easier in the upcoming v1.0 (or you can get the current github code), which includes PR18444, adding a get_feature_names_out to KBinsDiscretizer as well as Pipeline (and others).

If you are stuck on your version of sklearn, you can have a look at the source code that does that to help you patch together the same; it appears to not do very much itself, just calling to the internal _encoder object to do the work, and OneHotEncoder has had a get_feature_names for a while (soon to be replaced by get_feature_names_out).

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, March 7, 2022

[FIXED] sklearn: Adding names to feature_importances_/coef_ when using ColumnTransformer and Pipeline

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels