Issue
I just got started on Kaggle and for my first project I was working on the Titanic dataset.
I ran the following codeblock
ndf = pd.concat([pd.get_dummies(df[["Pclass", "SibSp", "Parch", "Sex"]]), (df[["Age", "Fare"]])],axis=1)
Although I'm getting the output as:
Pclass SibSp Parch Sex_female Sex_male Age Fare
0 3 1 0 0 1 22.0 7.2500
1 1 1 0 1 0 38.0 71.2833
2 3 0 0 1 0 26.0 7.9250
3 1 1 0 1 0 35.0 53.1000
4 3 0 0 0 1 35.0 8.0500
.. ... ... ... ... ... ... ...
886 2 0 0 0 1 27.0 13.0000
887 1 0 0 1 0 19.0 30.0000
888 3 1 2 1 0 NaN 23.4500
889 1 0 0 0 1 26.0 30.0000
890 3 0 0 0 1 32.0 7.7500
The Pclass, SibSp and Parch variables did not convert to one_hot encoded vectors though the Sex attribute did.
I didn't understand why because when I try to run pd.get_dummes() function on the Pclass variable alone, the result it gives me is perfectly fine.
1 2 3
0 0 0 1
1 1 0 0
2 0 0 1
3 1 0 0
4 0 0 1
... ... ... ...
886 0 1 0
887 1 0 0
888 0 0 1
889 1 0 0
890 0 0 1
Although the names of the columns have been converted to "0", "1" and "2" which of course is not fine actually...
But how can I fix the problem? I want all the features to be converted to one-hot encoded vectors.
Solution
Use OneHotEncoder
from sklearn
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'Pclass': [0, 1, 2], 'SibSp': [3, 1, 0],
'Parch': [0, 2, 2], 'Sex': [0, 1, 1]})
ohe = OneHotEncoder()
data = ohe.fit_transform(df[['Pclass', 'SibSp', 'Parch', 'Sex']])
df1 = pd.DataFrame(data.toarray(), columns=ohe.get_feature_names_out(), dtype=int)
Output:
>>> df
Pclass SibSp Parch Sex
0 0 3 0 0
1 1 1 2 1
2 2 0 2 1
>>> df1
Pclass_0 Pclass_1 Pclass_2 SibSp_0 SibSp_1 SibSp_3 Parch_0 Parch_2 Sex_0 Sex_1
0 1 0 0 0 0 1 1 0 1 0
1 0 1 0 0 1 0 0 1 0 1
2 0 0 1 1 0 0 0 1 0 1
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.