Wednesday, January 26, 2022

[FIXED] How to make predictions using scikit with missing values and categorical variables

January 26, 2022 machine-learning, python, scikit-learn No comments

Issue

I do not know how to make predictions because my training data and test data are different and I do not know how to handle these differences and the missing values. Here is my code:

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id') 
X_test = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll drop columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)

### One-Hot Encoding

# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))



from sklearn.preprocessing import OneHotEncoder


# Make copy to avoid changing original data (when imputing)
X_train_new = X_train.copy()
X_valid_new = X_valid.copy()

# Apply one-hot encoder to low cardinality cols
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train_new[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid_new[low_cardinality_cols]))


# put back the index lost during One-hot encoding
OH_cols_train.index = X_train_new.index
OH_cols_valid.index = X_valid_new.index

# Remove categorical columns which we will replace with the ones one-hot encoded (object calls because
# we also want to remove the high cardinality cols)
num_X_train = X_train_new.drop(object_cols, axis=1)
num_X_valid = X_valid_new.drop(object_cols, axis=1)

# Add one-hot encoded cols to numerical features

OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1) 
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

Up to this point it works. But when I want to make predictions, I fail. Here is my code:

# Make copy to avoid changing original data (when imputing)
X_train_new = X_train.copy()
X_valid_new = X_valid.copy()

# 🍕
X_test_new = X_test.copy()

# To keep things simple, we'll drop columns with missing values
cols_with_missing = [col for col in X_test_new.columns if X_test_new[col].isnull().any()] 
X_test_new.drop(cols_with_missing, axis=1, inplace=True)

# Categorical columns in the test data
new_object_cols = [col for col in X_test_new.columns if X_test_new[col].dtype == "object"]

# Columns that will be one-hot encoded
new_low_cardinality_cols = [col for col in new_object_cols if X_test_new[col].nunique() < 10]

# Columns that will be dropped from the dataset
new_high_cardinality_cols = list(set(new_object_cols)-set(new_low_cardinality_cols))

OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test_new[new_low_cardinality_cols]))

Here is my error:

ValueError: The number of features in X is different to the number of features of the fitted data. The fitted data had 24 features and the X has 19 features.

Solution

One cause of the shape mismatch in your train/test data is that you're creating new categorical variables after you do the train/test split. Most likely what is happening is that you have some categories that end up only in the train or test split and so the shapes end up mismatching.

I'd move this train/test split line:

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

to the end of your pre-processing steps to avoid this issue and so that you pre-process the train/test data in the exact same way.

Answered By - TC Arlen

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 26, 2022

[FIXED] How to make predictions using scikit with missing values and categorical variables

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels