Issue
I have a few columns (or say even one column) that I need to do imputation ( I don't want to do imputation on all the columns).
Once imputation is done, I would like to do standard scalar on more columns (including the above had gone through the imputation). I also have other transformations apart from this and would like to preserve the order of columns.
Is there a way to achieve this when using columntransformer
and pipeline
.
I see that within a columntransformer
you can use pipeline
, for the imputation and standard scalar (pipeline does the impute and scalar). But then, both the impute and scaling will be performed on all the columns.
I need them to be performed on subset of columns only. Is there a way to achieve the above ?
Update:
Here is an example of the issue
I wanted to perform target encoding for some columns (say c1, c2,c3) and I also want to perform imputation for a column (c4) and I now wanted to perform standardscalar (once the target encoding and imputation are performed) for all these and more columns (c1,c2,c3,c4,c5,c6,c7,c8).
Note I also have more columns that I need to perform, say OHE, but those columns are disjoint from the above c1...c8.
I am not sure how to use column transformer and pipeline, such that when I perform different transformations with some of the columns being same among the transformations (as in the above example). Is there a way to do this as part of the pipeline. Thank you.
Solution
TL;DR: for your use case, jump to the last section
One ColumnTransformer, many Pipelines
This is what I was suggesting in the comments and I've advanced elsewhere on this site. We use a single ColumnTransformer
, each of whose transformers is a Pipeline
. There is one pipeline for each combination of preprocessing steps you would like to perform. This has the advantage of being able to specify columns by name for each transformer. Downsides include having lots of different copies of the scaler, so if you wanted to hyperparameter-tune something about a transformer you'd have to change it in many places; also, if you have a lot of different unique preprocessing step combinations, this would take a lot of code to specify, but there are some ways to partially mitigate that.
pipe_target_encode = Pipeline([
("te", TargetEncoder()),
("sc", StandardScaler()),
])
pipe_impute = Pipeline([
("imp", SimpleImputer()),
("sc", StandardScaler()),
])
ColumnTransformer([
("target_enc", pipe_target_encode, ["c1", "c2", "c3"]),
("impute", pipe_impute, ["c4"]),
("scale", StandardScaler(), ["c5", "c6", "c7", "c8"]),
("ohe", OneHotEncoder(), ["c9", "c10"]),
])
One Pipeline, many ColumnTransformers
This one will be more readily possible when dataframes-out is accomplished, but if you can keep track of column ordering it can be done now.
target_enc = ColumnTransformer(
[("target_enc", TargetEncoder(), [3])], # c4
remainder="passthrough",
)
impute = ColumnTransformer(
[("impute", SimpleImputer(), [1, 2, 3])], # c4 is now first; c1, c2, c3
remainder="passthrough",
)
scale = ColumnTransformer(
[("scale", StandardScaler(), [0, 1, 2, 3, 4, 5, 6, 7])], #c1-3 are first, then c4, then c5-8
remainder="passthrough",
)
ohe = ColumnTransformer(
[("ohe", OneHotEncoder(), [8, 9])],
remainder="passthrough",
)
pipe = Pipeline([
("target_enc", target_enc),
("impute", impute),
("scale", scale),
("ohe", ohe),
])
The columns will be output in order 8-dummies, 9-dummies, 1, 2, 3, 4, 5, 6, 7. You could move the steps around to try to get the columns into the better order, but since OHE will produce an apriori-unknown number of columns, it might be tough to get the column indices right.
Hybrid
The best for this particular case, it's the cleanest and most semantically correct.
Because your transformers operate in a hierarchical way, we can get away with all column specifications being strings; if we had to specify things in a ColumnTransformer
after any sklearn step, we'd have input arrays and would have to resort to index specification as above (again, until pandas-out is a thing).
step1 = ColumnTransformer(
[
("target_enc", TargetEncoder(), ["c1", "c2", "c3"]),
("impute", SimpleImputer(), ["c4"]),
],
remainder="passthrough",
)
num_pipe = Pipeline([
("prep", step1),
("scale", StandardScaler())
])
preproc = ColumnTransformer([
("num", num_pipe, ["c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8"]),
("ohe", OneHotEncoder(), ["c9", "c10"]),
])
Run this snippet for the Hybrid approach's diagram, and see the overflow answer for diagrams of the other two approaches.
<style>#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c {color: black;background-color: white;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c pre{padding: 0;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-toggleable {background-color: white;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c label.sk-toggleable__label-arrow:before {content: "▸";float: left;margin-right: 0.25em;color: #696969;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: "▾";}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-estimator:hover {background-color: #d4ebff;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-parallel-item::after {content: "";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-serial::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 2em;bottom: 0;left: 50%;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-item {z-index: 1;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-parallel::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 2em;bottom: 0;left: 50%;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-parallel-item {display: flex;flex-direction: column;position: relative;background-color: white;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-parallel-item:only-child::after {width: 0;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;position: relative;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-label label {font-family: monospace;font-weight: bold;background-color: white;display: inline-block;line-height: 1.2em;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-label-container {position: relative;z-index: 2;text-align: center;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c div.sk-text-repr-fallback {display: none;}</style>
<div id="sk-5d6832b8-0986-4e65-92ae-040d8a1ce75c" class="sk-top-container"><div class="sk-text-repr-fallback"><pre>ColumnTransformer(transformers=[('num',
Pipeline(steps=[('prep',
ColumnTransformer(remainder='passthrough',
transformers=[('target_enc',
TargetEncoder(),
['c1',
'c2',
'c3']),
('impute',
SimpleImputer(),
['c4'])])),
('scale', StandardScaler())]),
['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7',
'c8']),
('ohe', OneHotEncoder(), ['c9', 'c10'])])</pre><b>Please rerun this cell to show the HTML repr or trust the notebook.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="8b1cfddf-93ac-47a5-8d5f-46f4b91f1dbd" type="checkbox" ><label for="8b1cfddf-93ac-47a5-8d5f-46f4b91f1dbd" class="sk-toggleable__label sk-toggleable__label-arrow">ColumnTransformer</label><div class="sk-toggleable__content"><pre>ColumnTransformer(transformers=[('num',
Pipeline(steps=[('prep',
ColumnTransformer(remainder='passthrough',
transformers=[('target_enc',
TargetEncoder(),
['c1',
'c2',
'c3']),
('impute',
SimpleImputer(),
['c4'])])),
('scale', StandardScaler())]),
['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7',
'c8']),
('ohe', OneHotEncoder(), ['c9', 'c10'])])</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="b42df77f-364c-47b4-81dc-2591b43c0201" type="checkbox" checked ><label for="b42df77f-364c-47b4-81dc-2591b43c0201" class="sk-toggleable__label sk-toggleable__label-arrow">num</label><div class="sk-toggleable__content"><pre>['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8']</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-serial"><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="27c515f2-9c65-498d-ad8b-1e7942530ec9" type="checkbox" ><label for="27c515f2-9c65-498d-ad8b-1e7942530ec9" class="sk-toggleable__label sk-toggleable__label-arrow">prep: ColumnTransformer</label><div class="sk-toggleable__content"><pre>ColumnTransformer(remainder='passthrough',
transformers=[('target_enc', TargetEncoder(),
['c1', 'c2', 'c3']),
('impute', SimpleImputer(), ['c4'])])</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="7d4f168c-5ce5-484d-bb05-ba4986ba97d5" type="checkbox" checked ><label for="7d4f168c-5ce5-484d-bb05-ba4986ba97d5" class="sk-toggleable__label sk-toggleable__label-arrow">target_enc</label><div class="sk-toggleable__content"><pre>['c1', 'c2', 'c3']</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="f645cf10-e9b2-49d8-8822-a37d9c67b337" type="checkbox" ><label for="f645cf10-e9b2-49d8-8822-a37d9c67b337" class="sk-toggleable__label sk-toggleable__label-arrow">TargetEncoder</label><div class="sk-toggleable__content"><pre>TargetEncoder()</pre></div></div></div></div></div></div><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="aac2d6a2-5b8c-4907-8bc9-4a6d99244755" type="checkbox" checked ><label for="aac2d6a2-5b8c-4907-8bc9-4a6d99244755" class="sk-toggleable__label sk-toggleable__label-arrow">impute</label><div class="sk-toggleable__content"><pre>['c4']</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="e085f2bd-c170-490d-a432-685977ff2741" type="checkbox" ><label for="e085f2bd-c170-490d-a432-685977ff2741" class="sk-toggleable__label sk-toggleable__label-arrow">SimpleImputer</label><div class="sk-toggleable__content"><pre>SimpleImputer()</pre></div></div></div></div></div></div><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="c9b092a5-aa9f-46e1-a7e3-5c97c4d6001e" type="checkbox" ><label for="c9b092a5-aa9f-46e1-a7e3-5c97c4d6001e" class="sk-toggleable__label sk-toggleable__label-arrow">remainder</label><div class="sk-toggleable__content"><pre></pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="b89a30e1-bcd2-4677-8071-99e6857ffb5f" type="checkbox" ><label for="b89a30e1-bcd2-4677-8071-99e6857ffb5f" class="sk-toggleable__label sk-toggleable__label-arrow">passthrough</label><div class="sk-toggleable__content"><pre>passthrough</pre></div></div></div></div></div></div></div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="591ec259-41a7-403e-bc5d-4f45fa612895" type="checkbox" ><label for="591ec259-41a7-403e-bc5d-4f45fa612895" class="sk-toggleable__label sk-toggleable__label-arrow">StandardScaler</label><div class="sk-toggleable__content"><pre>StandardScaler()</pre></div></div></div></div></div></div></div></div><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="410b46ee-6c96-4ddd-84bc-73577e0c2538" type="checkbox" checked ><label for="410b46ee-6c96-4ddd-84bc-73577e0c2538" class="sk-toggleable__label sk-toggleable__label-arrow">ohe</label><div class="sk-toggleable__content"><pre>['c9', 'c10']</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="5c4492e0-ba44-446b-a82d-36caa3f1cccb" type="checkbox" ><label for="5c4492e0-ba44-446b-a82d-36caa3f1cccb" class="sk-toggleable__label sk-toggleable__label-arrow">OneHotEncoder</label><div class="sk-toggleable__content"><pre>OneHotEncoder()</pre></div></div></div></div></div></div></div></div></div></div>
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.