Issue
I am using h2o
automl library from python with scikit-learn wrapper to create a pipeline for training my model. I follow this example, recommended by the official documentation:
from sklearn import datasets
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from h2o.sklearn import H2OAutoMLClassifier
X_classes_train, X_classes_test, y_classes_train, y_classes_test = train_test_split(X_classes, y_classes, test_size=0.33)
pipeline = Pipeline([
('polyfeat', PolynomialFeatures(degree=2)),
('featselect', SelectKBest(f_classif, k=5)),
('classifier', H2OAutoMLClassifier(max_models=10, seed=2022, sort_metric='logloss'))
])
pipeline.fit(X_classes_train, y_classes_train)
preds = pipeline.predict(X_classes_test)
So, I've trained my pipeline/model, now I want to get an H2OAutoML
object out of H2OAutoMLClassifier
wrapper to invoke .explain()
method on it and get some insight about the features and models.
How do I do that?
Solution
There's no easy way to use .explain()
on sklearn's pipeline.
You can extract the H2OAutoML's leader model (the best model trained in the AutoML) and on that you could call the .explain()
.
For .explain()
to work you'll need an H2OFrame with the same features as was used to train the model and that's the problem for both interpretability and ease of use. You will need to create the dataset using the first 2 steps in the pipeline (in your example polyfeat
and featselect
). This alone will make it very hard to interpret - the columns will get names like C1
, C2
, ...
You can do the things I described using the following code:
transformed_df = X_classes_test
num_of_steps = len(pipeline.steps)
# Transform the data using the pipeline
for i in range(num_of_steps - 1):
transformed_df = pipeline.steps[i][1].transform(transformed_df)
# Create the H2OFrame
h2o_frame = h2o.H2OFrame(transformed_df)
h2o_frame.columns = [c for c in pipeline.steps[num_of_steps - 1][1].estimator.leader._model_json["output"]["names"]
if c != pipeline.steps[num_of_steps - 1][1].estimator.leader.actual_params["response_column"]]
# Add the response column
h2o_frame = h2o_frame.cbind(h2o.H2OFrame(y_classes_test.to_frame()))
h2o_frame.set_name(h2o_frame.shape[1]-1, pipeline.steps[num_of_steps - 1][1].estimator.leader.actual_params["response_column"])
# Run the .explain()
pipeline.steps[num_of_steps - 1][1].estimator.leader.explain(h2o_frame)
However, I'd recommend another approach - if you need interpretability and do not need to cross-validate the whole pipeline. Use the first N-1 steps of the pipeline to create a data frame, add appropriate column names to the newly created data frame and then run h2o AutoML using the h2o api. This will make it easier to use .explain()
and other interpretability related methods and you will have column names with actual meaning rather than just names based on column order.
Answered By - Tomáลก Frýda
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.