Issue
I'm developing a SVR for ~100 continuous features and a continuous label.
For scaling the data, I wrote:
#Read in
df = pd.read_csv(data_path,sep='\t')
features = df.iloc[:,1:-1] #100 features
target = df.iloc[:,-1] #The label
names = df.iloc[:,0] #Column names
#Scale features
scaler = StandardScaler()
scaled_df = scaler.fit_transform(features)
# rename columns (since now its an np array)
features.columns = df_columns
So now I have a scaled data frame, and my next step was to split into train and test, and then develop a model (SVR):
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)
model = SVR()
...and then I fit the model to the data.
But I noticed other people don't fit the StandardScaler() to the whole data frame, but they split the dataframe into train and test first, and then apply StandardScaler() to each separately.
Is there a difference between whether you apply the StandardScaler to the whole data frame, or train and test separately?
Solution
You need to apply StandardScaler to the training set to prevent the distribution of the test set leaking into the model. If you fit the scaler on the full dataset before splitting, the test set information is used to transform the training set and use it to train the model.
Answered By - Danylo Baibak
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.