Issue
Im trying to create a pipeline that combines :
- Pipeline for all kinds of features, no matter the type (cleaning incorrect data by feature)
- Pipeline for categorical features (categorical imputer)
- Pipeline for numerical features (numerical imputer)
in a sklearn.compose.ColumnTransformer¶.
This here is a piece of code for what I'm trying to do
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
alltypes = Pipeline([
('column_name_normalizer',ColumnNameNormalizer()),
('column_incorrect_data_cleaner',ColumnIncorrectDataCleaner(some_parameter),
])
num_pipeline = Pipeline([
('imputer',CustomNumImputer(some_parameter)), # remplir les valeurs manquants
])
cat_pipeline = Pipeline([
("cat", CustomCatImputer(some_parameter))
])
full_pipeline = ColumnTransformer([
("alltypes",alltypes,allcolumns),
("num", num_pipeline, numfeat),
("cat",cat_pipeline,catfeat)
])
try:
X = pd.DataFrame(full_pipeline.fit_transform(X).toarray())
except AttributeError:
X = pd.DataFrame(full_pipeline.fit_transform(X))
However in the end I get a dataframe with more number of features than at the beginning which is due to the fact that all the features from the pipelines are concatenated, instead of an operator UNION being performed on them:
For instance I want to do some transformations on all features, then do some transformations on categorical features, and do some transformations on numerical features, but I want the outputing dataframe to be always the same size.
Do you know how can I fix this?
Solution
You need to combine the sequential power of Pipeline
, e.g.
cat_num_split = ColumnTransformer([
("num", num_pipeline, numfeat),
("cat", cat_pipeline, catfeat),
])
full_pipeline = Pipeline([
("alltypes", alltypes),
("cat_num", cat_num_split),
)]
There is a catch here: the alltypes
transformer will result in a numpy array without information about which columns are which; your cat_num_split
feature lists numfeat
and catfeat
will rely on your knowledge of which columns are which, and cannot use the column names.
An alternative, that doesn't run into the feature name issue, is to switch the order here.
num_full_pipe = Pipeline([
("common", alltypes),
("num", num_pipeline),
])
cat_full_pipe = Pipeline([
("common", alltypes),
("cat", cat_pipeline),
])
full_pipeline = ColumnTransformer([
("num", num_full_pipe, numfeat),
("cat", cat_full_pipe, catfeat),
])
See also Consistent ColumnTransformer for intersecting lists of columns.
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.