Issue
I created a custom Pipeline that adds one column 'Message Length', encodes categorical & boolean columns, and drops selected columns.
def custom_pipeline(to_drop: list = [], features_out: bool = False) -> Pipeline:
# Add 'Message Length' attribute based on the 'Raw Message' column
attrib_adder = AttributeAdder(attribs_in=['Raw Message'], attribs_out=['Message Length'], func=get_message_length)
# Define the column transformer
preprocessor = ColumnTransformer(transformers=[
('virus_scanned', enumerate_virus_scanned, ['X-Virus-Scanned']),
('priority', enumerate_priority, ['X-Priority']),
('encoding', enumerate_encoding, ['Encoding']),
('flags', enumerate_bool, ['Is HTML', 'Is JavaScript', 'Is CSS']),
('select', 'passthrough', ['Attachments', 'URLs', 'IPs', 'Images', 'Message Length']),
('drop_out', 'drop', to_drop) # --> This does not work
])
# Define pipeline
pipe = Pipeline(steps=[
('attrib_adder', attrib_adder),
('preprocessor', preprocessor),
('scaler', MinMaxScaler())
])
# Get features out
if features_out:
features = [col for col in chain(*[cols for _,_,cols in preprocessor.transformers[:-1]]) if col not in to_drop]
# Return pipeline and features
return pipe, features
# Return pipeline
return pipe
Unfortunately, the last 'drop_out
' transformer does not drop columns.
For example, even if I pass
to_drop = ['Attachments', 'Message Length']
it still preserves them in the output.
What might be the possible solution?
Solution
The transformers are applied completely separately in parallel. So in these two lines:
('select', 'passthrough', ['Attachments', 'URLs', 'IPs', 'Images', 'Message Length']),
('drop_out', 'drop', to_drop) # --> This does not work
you tell the transformer to pass Attachments
through untouched and also to drop Attachments
, taking the union of those two actions, which results in a single copy of Attachments
in the output.
Using drop
as a transformer is unlikely to be of use except when exploring different options, when you might take some column and change the transformer from passthrough
to drop
to StandardScaler
etc.
If all you want to do is remove Attachments
(and others), simply remove them from the list of features in the select
transformer; since you have left the parameter remainder
as the default drop
, any columns not listed for any of the transformers will be dropped.
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.