Issue
I'm trying to preprocess data with ColumnTransformer but with no Pipeline. Here's the code:
object_pre = Pipeline(steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("ordencoder", OrdinalEncoder(mapping=mapping)),
("onehot", OneHotEncoder(cols=ord_cols))
])
num_pre = SimpleImputer()
object_cols = [col for col in X.columns if X[col].dtype == "object"]
num_cols = list(set(X.columns) - set(object_cols))
preprocessor = ColumnTransformer(transformers=[
("num_pre", num_pre, num_cols),
("object_pre", object_pre, object_cols)
])
X_temp = pd.DataFrame(preprocessor.fit_transform(X))
However, when I run it, I get the following error on the last line:
KeyError: 'Education'
I am convinced that the problem has to do with the mapping variable, which is created with the following code:
mapping = [
{
"col": "Education",
"mapping": {
"Not Graduate": 0,
"Graduate": 1
}
},
{
"col": "Dependents",
"mapping": {
"0": 0,
"1": 1,
"2": 2,
"3+": 3,
}
},
]
ord_cols = ["Gender"]
for i in list(set(X.columns) - set(ord_cols)):
mapping.append({
"col": i,
"mapping": {
"No": 0,
"Yes": 1
}
})
Can you please tell me what I'm doing wrong? Thanks in advance ;)
Solution
The SimpleImputer
first step of your pipeline transforms the data into a numpy array, so column names aren't available for the mapping
in the OrdinalEncoder
(from category_encoder
package) second step. OrdinalEncoder
has a parameter handle_missing
with one option return_nan
, so I think you can swap the order of the first two steps and have the same effect.
(The sklearn version of OrdinalEncoder
passes missing values along, starting in v1.0, so you could maybe revert to that, but then you'd have the array categories
instead of the dict mapping
, so you'd lose feature name capabilities again.)
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.