Issue
I want to use lasso regression in sklearn to run on my data.
all my attributes in my dataframe is numeric type(by numeric, I mean they are all integer).
but some of them clearly should be categorical(for example, 'race' attribute in my dataframe is an attribute have three value 1,2,3 where each value represent one race).
What I did is first set those columns as string type by using astype('str')
then use code astype('categorical')
to transform those column's data type to categorical.
Finally, I used sklearn.linear_model.Lasso
on those transformed features.
My question is can sklearn.linear_model.Lasso
recognize those variables are categorical? Or the only way to deal with those type of categorical data is one hot encoding?
Solution
I would use OneHotEncoding to separate the categorical variables into different columns. However, make sure that you avoid the "dummy variable trap" by getting rid of the first column of your new dummy variables.
You want to use OneHotEncoding so that your program is able to recognize that no category of your categorical variables is numerically greater than the others. For example, let's say you have a column wherein 0 = Spain, 1 = Germany, and 2 = France. In this case, your program would assume that the column for geography is a continuous variable, wherein 2(France) is greater than 1(Germany). Thus, you must use OneHotEncoding to create three separate columns of 0's and 1's.
I hope that answers your question.
Answered By - Apie
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.