Tuesday, April 12, 2022

[FIXED] Inspection of the feature importance in scikit-learn pipelines

April 12, 2022 machine-learning, pipeline, python, scikit-learn No comments

Issue

I have defined the following pipelines using scikit-learn:

model_lg = Pipeline([("preprocessing", StandardScaler()), ("classifier", LogisticRegression())])
model_dt = Pipeline([("preprocessing", StandardScaler()), ("classifier", DecisionTreeClassifier())])
model_gb = Pipeline([("preprocessing", StandardScaler()), ("classifier", HistGradientBoostingClassifier())])

Then I used cross validation to evaluate the performance of each model:

cv_results_lg = cross_validate(model_lg, data, target, cv=5, return_train_score=True, return_estimator=True)
cv_results_dt = cross_validate(model_dt, data, target, cv=5, return_train_score=True, return_estimator=True)
cv_results_gb = cross_validate(model_gb, data, target, cv=5, return_train_score=True, return_estimator=True)

When I try to inspect the feature importance for each model using the coef_ method, it gives me an attribution error:

model_lg.steps[1][1].coef_
AttributeError: 'LogisticRegression' object has no attribute 'coef_'

model_dt.steps[1][1].coef_
AttributeError: 'DecisionTreeClassifier' object has no attribute 'coef_'

model_gb.steps[1][1].coef_
AttributeError: 'HistGradientBoostingClassifier' object has no attribute 'coef_'

I was wondering, how I can fix this error? or is there any other approach to inspect the feature importance in each model?

Solution

Imo, the point here is the following. On the one side, the pipeline instances model_lg, model_dt etc. are not explicitely fitted (you're not calling method .fit() on them directly) and this prevents you from trying to access the coef_ attribute on the instances themselves.

On the other side, by calling .cross_validate() with parameter return_estimator=True (which is possible with .cross_validate() only among the cross-validation methods), you can get the fitted estimators back for each cv split, but you should access them via your dictionaries cv_results_lg, cv_results_dt etc (on the 'estimator' key). Here's the reference in the code and here's an example:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_validate

X, y = load_iris(return_X_y=True)

model_lg = Pipeline([("preprocessing", StandardScaler()), ("classifier", LogisticRegression())])

cv_results_lg = cross_validate(model_lg, X, y, cv=5, return_train_score=True, return_estimator=True)

These would be - for instance - the results computed on the first fold.

cv_results_lg['estimator'][0].named_steps['classifier'].coef_

Useful insights on related topics might be found in:

Answered By - amiola

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, April 12, 2022

[FIXED] Inspection of the feature importance in scikit-learn pipelines

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels