Issue
I have a data frame that looks similar to:
salary job title Raiting Company_Name Location Seniority
0 100 SE 5 apple sf vp
1 120 DS 4 Samsung la Jr
2 230 QA 5 google sd Sr
(My df
has more categorical features than this)
Usually, when predicting from a model it goes something like
in[1]: inModel_name.predict(catagory_1, catagory_2,..etc)
out[2]: predicted_var
Whereas after you use pd.get_dummies
you have a drastic amount more of columns depending on how many categorical features you have made, making the method I mentioned before impractical when trying to predict data. How do you go about referencing the multiple columns instead of manually putting in 0s.
Solution
Instead of using pd.get_dummies
I would recommend using sklearn's onehotencoder
Check this link for details on how to replace pd.get_dummies with proper data encoding methods.
This allows you to use .fit_transform
on your training data to get one hot encoded representation for training. And when trying to use test data for prediction you can simply just use its .transform
method to get one hot encoded representation for those as well.
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)
enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
enc.transform([['Female', 1], ['Male', 4]]).toarray()
array([[1., 0., 1., 0., 0.],
[0., 1., 0., 0., 0.]])
Answered By - Akshay Sehgal
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.