Issue
I am working on the Kaggl insurance pricing competition (https://www.kaggle.com/floser/french-motor-claims-datasets-fremtpl2freq). The data looks like this:
IDpol ClaimNb Exposure Area VehPower VehAge DrivAge BonusMalus VehBrand VehGas Density Region
0 1.0 1 0.10 D 5 0 55 50 B12 Regular 1217 R82
1 3.0 1 0.77 D 5 0 55 50 B12 Regular 1217 R82
2 5.0 1 0.75 B 6 2 52 50 B12 Diesel 54 R22
3 10.0 1 0.09 B 7 0 46 50 B12 Diesel 76 R72
4 11.0 1 0.84 B 7 0 46 50 B12 Diesel 76 R72
For preprocessing etc. I am using a Pipeline including a ColumnTransformer:
pt_columns = ['BonusMalus', 'RBVAge']
log_columns = ['Density']
kbins_columns = ['VehAge','VehPower', 'DrivAge']
cat_columns = ['Area', 'Region', 'VehBrand', 'VehGas']
X_train['RBVAge'] = 0
X_train.loc[(X_train['VehGas'] == 'Regular') & (X_train['VehBrand'] == 'B12') & (X_train['VehAge'] == 0), 'RBVAge'] = 1
ct = ColumnTransformer([('pt', 'passthrough', pt_columns),
('log', FunctionTransformer(np.log1p, validate=False), log_columns),
('kbins', KBinsDiscretizer(), kbins_columns),
('ohe', OneHotEncoder(), cat_columns)])
pipe_poisson_reg = Pipeline([('cf_trans', ct),
('ssc', StandardScaler(with_mean = False)),
('poisson_regressor', PoissonRegressor())])
Once the model is fitted I'd like to display the feature importances and names of the features like this:
name | feature_importances_ |
---|---|
Area_A | 0.25 |
Area_B | 0.10 |
VehAge | 0.30 |
... | ... |
The problem I am facing is to get the feature names while using a ColumnTransformer. Especially getting the feature names from the KBinsDiscretizer, which uses OneHot-encoding as well, does not work easily. What I have tried so far is creating a numpy array with the feature names manually, but as I have said I cannot manage to get the feature names from the KBinsDiscretizer and this solution does not seem very elegant.
columnNames = np.append(pt_columns, pipe['cftrans'].transformers_[1][1].get_feature_names(cat_columns))
Is there a simple (maybe even build in) way to create a DataFrame including both the feature names and feature importances?
Well and since we are already here (this might be a bit off topic): Is there a simple way to create a custom ColumnTransformer which adds the new column 'RBVAge', which I currently add manually?
Thanks in advance
Solution
This will be a whole lot easier in the upcoming v1.0 (or you can get the current github code), which includes PR18444, adding a get_feature_names_out
to KBinsDiscretizer
as well as Pipeline
(and others).
If you are stuck on your version of sklearn, you can have a look at the source code that does that to help you patch together the same; it appears to not do very much itself, just calling to the internal _encoder
object to do the work, and OneHotEncoder
has had a get_feature_names
for a while (soon to be replaced by get_feature_names_out
).
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.