Issue
def preprocessing(X_train):
cat_cols = []
num_cols = []
for cols in X_train.columns:
if X_train[cols].nunique()<10 and X_train[cols].dtype =="object":
cat_cols.append(cols)
elif X_train[cols].dtype in ["int64","float64"]:
num_cols.append(cols)
full_cols = cat_cols+num_cols
num_transformer = SimpleImputer(strategy = "constant")
cat_transformer = Pipeline(steps =[
("imputer", SimpleImputer(strategy = "most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer(
transformers = [
("num", num_transformer, num_cols),
("cat", cat_transformer, cat_cols)
])
return preprocessor.fit_transform(X_train)
The Code is creating a preprocessor function for transforming data se. i train the model i got 226 features. But when i tried to transform the testing dataset for prediction, i only got 217.
the Error message : Feature shape mismatch, expected: 226, got 217
The dataset i am using: https://www.kaggle.com/competitions/home-data-for-ml-course
I wanna know what this happen and how to solve it
Solution
You should fit preprocessor on training data but only transform test data. If you refit it on test data it will find a completely different mapping, the fact that shapes are mismatched is a lucky error, as otherwise you would just get a silent issue where code runs, but model gets completely scrambled representation. This is especially important with things like one hotting. Imagine your training data has, for feature 1, values ["cat", "dog", "duck"] and so cat=>(1,0,0), dog=>(0,1,0), duck=>(0,0,1). But in test you only see ["cat", "duck"] and thus cat=>(1, 0), duck=>(0,1), so you have a shape mismatch and a duck became sort of a dog!
Answered By - lejlot
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.