Issue
I'm trying to Work on the Titanic Dataset as my first Kaggle Project and I ran into this error. I kept searching for a solution here on Stack but i still can't figure it out.
I made the two Pipelines to preprocess the numerical and categorical features:
num_pipeline = Pipeline([
('imputer', SimpleImputer( strategy='median')),
('scaler', StandardScaler())])
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder()) ])
and then i've joined them into a ColumnTransformer
preprocessor = ColumnTransformer(
transformers = [
('num', num_pipeline, numeric_features),
('cat', cat_pipeline, categorical_features) ])
numeric_features and categorical_features being the list of numerical and categorical features:
numeric_features = ['Age', 'SibSp', 'Parch', 'Fare']
categorical_features = ['Pclass', 'Sex', 'Embarked']
Finally, in my final Pipeline I add a Classifier:
knn = Pipeline([
('Preprocessor' , preprocessor),
('Classifier', KNeighborsClassifier())
])
knn.fit(X_train, y_train)
Here is when I get the "ValueError: Input contains NaN"
Solution
train = pd.read_csv('train.csv')
train.isna().sum()
Output:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
The columns Age
, Cabin
and Embarked
contain NaN values. However, you do not include the Cabin
column in numeric_features
or categorical_features
, so it's values do not get imputed. This is why you get the error.
Answered By - Akash Haridas
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.