Issue
I am trying to train this random classifier to see if my preprocessing works. I think I made a mistake separating my training data and labels as I see in the error message (Price). But I do not know exactly what is wrong.
Code:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
def diamond_preprocess(data_dir):
data = pd.read_csv(data_dir)
cleaned_data = data.drop(['id', 'depth_percent'], axis=1) # Features I don't want
x = cleaned_data.drop(['price'], axis=1) # Train data
y = cleaned_data['price'] # Label data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
numerical_features = cleaned_data.select_dtypes(include=['int64', 'float64']).columns
categorical_features = cleaned_data.select_dtypes(include=['object']).columns
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')), # Fill in missing data with median
('scaler', StandardScaler()) # Scale data
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')), # Fill in missing data with 'missing'
('onehot', OneHotEncoder(handle_unknown='ignore')) # One hot encode categorical data
])
preprocessor_pipeline = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])
rf = Pipeline(steps=[('preprocessor', preprocessor_pipeline),
('classifier', RandomForestClassifier())])
rf.fit(x_train, y_train)
cleaned_data.columns: Index(['carat', 'cut', 'color', 'clarity', 'table', 'price', 'length', 'width', 'depth'], dtype='object')
Error:
File "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'price'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\17574\Anaconda3\envs\kraken-gpu\lib\site-packages\sklearn\utils\__init__.py", line 396, in _get_column_indices
col_idx = all_columns.get_loc(col)
File "C:\Users\17574\Anaconda3\envs\kraken-gpu\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 'price'
The above exception was the direct cause of the following exception:
ValueError: A given column is not a column of the dataframe
It seems to be mad that I am feeding x_train (which has price excluded as it is my training data) into the preprocessing pipeline which includes the 'price' feature. This shouldn't be a problem because my labels are all 'price' integers and need to be preprocessed right? Do I need a separate transformer just for labels?
Solution
You are performing your ColumnTransformer
based on columns defined in the cleaned_data
DataFrame instead of columns defined in x_train
.
You can either modify your categorical and numerical features by computing them from x_train
as follows:
numerical_features = x_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = x_train.select_dtypes(include=['object']).columns
Or even better, by using sklearn.compose.make_column_selector
to perform the selection as follows:
from sklearn.compose import make_column_selector
preprocessor_pipeline = ColumnTransformer(
transformers=[
('num', numerical_transformer, make_column_selector(dtype_exclude=object)),
('cat', categorical_transformer, make_column_selector(dtype_include=object))
])
Answered By - Antoine Dubuis
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.