Issue
I'm experimenting with machine learning regressors and I was using the dataset train.csv from the following webpage: https://www.kaggle.com/c/rossmann-store-sales/data?select=train.csv
I was trying to train an SVR but it was taking a lot of time to fit, so I realized the problem is probably because I haven't normalized data.
I know a normal practice to do is to normalize the columns, but I'm not really sure which ones should I apply it to. There are some binary variables and some continuous, and I feel like it would be weird to normalize the binary variables. Is this correct?
The table columns are the following:
Open, promo and SchoolHoliday are binary. StateHoliday can take values from 0 to 4. The other ones are ints (except date obviously).
Solution
Store
, DayOfWeek
, Open
, Promo
, StateHoliday
, SchoolHoliday
are categorical features. They can be encoded as one-hot-encoded vector using OneHotEncoder
.
Sales
, Customers
are numerical features and can be encoded for example with StandardScaler
, RobustScaler
etc.
see scikit-learn preprocessing documentation here for additional transformations.
Answered By - Antoine Dubuis
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.