Issue
Python beginner here...
Trying to understand how to use OneHotEncoder from the sklearn.preprocessing library. I feel pretty confident in using it in combination with fit_transform so that the results can also be fit to the test dataframe. Where I get confused is what to do with the resulting encoded array. Do you then convert the ohe results back to a dataframe and append it to the existing train/test dataframe?
The ohe method seems a lot more cumbersome than the pd.get_dummies method, but from my understanding using ohe with fit_transform makes it easier to apply the same transformation to the test data.
Searched for hours and having a lot of trouble trying to find a good answer for this.
Example with the widely used Titanic dataset:
ohe = OneHotEncoder()
imp = SimpleImputer()
ct = make_column_transformer(
(imp, ['Age']),
(ohe, ['Sex', 'Embarked']),
remainder='passthrough')
ct.fit_transform(train)
Result:
array([[22. , 0. , 1. , ..., 1. ,
0. , 7.25 ],
[38. , 1. , 0. , ..., 1. ,
0. , 71.2833 ],
[26. , 1. , 0. , ..., 0. ,
0. , 7.925 ],
...,
[29.69911765, 1. , 0. , ..., 1. ,
2. , 23.45 ],
[26. , 0. , 1. , ..., 0. ,
0. , 30. ],
[32. , 0. , 1. , ..., 0. ,
0. , 7.75 ]])
Do you pass the resulting array directly into a variable, for example X and y for train_test_split to run the final models off of? Or is there a way to convert the result back to a dataframe with column labels for further EDA?
Solution
Your intuition is correct: pandas.get_dummies()
is a lot easier to use, but the advantage of using OHE is that it will always apply the same transformation to unseen data. You can also export the instance using pickle
or joblib
and load it in other scripts.
There may be a way to directly reattach the encoded columns back to the original pandas.DataFrame
. Personally, I go about it the long way. That is, I fit the encoder, transform the data, attach the output back to the DataFrame and drop the original column.
# Columns to encode
cols = ['Sex','Embarked']
# Initialize encoder
ohe = OneHotEncoder()
# Fit to data
ohe.fit(df[cols])
# Declare encoded data as new columns in `df`
df[ohe.get_feature_names] = ohe.transform(df[cols])
# Drop unencoded columns
df.drop(cols, axis=1, inplace=True)
Finally, I noticed you said:
I feel pretty confident in using it in combination with fit_transform so that the results can also be fit to the test dataframe.
I would like to point out that you should not fit the encoder again! Rather, you should use ohe.transform(X_test[cols])
when dealing with new data. Do not use fit_transform()
again or the results may vary from one dataset to another.
Answered By - Arturo Sbr
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.