Issue
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
What I know is fit()
method calculates mean and standard deviation of the feature and then transform()
method uses them to transform the feature into a new scaled feature. fit_transform()
is nothing but calling fit()
& transform()
method in a single line.
But here why are we only calling fit()
for training data and not for testing data??
Does that means we are using mean & standard deviation of training data to transform our testing data ??
Solution
fit
computes the mean and stdev to be used for later scaling, note it's just a computation with no scaling done.
transform
uses the previously computed mean and stdev to scale the data (subtract mean from all values and then divide it by stdev).
fit_transform
does both at the same time. So you can do it with just 1 line of code.
For X_train
dataset, we do fit_transform
because we need to compute mean and stdev, and then use it to scale the X_train
dataset. For X_test
dataset, since we already have the mean and stdev, we only do the transformation part.
Edit: X_test
data should be totally unseen and unknown (ie, no info is extracted from them), so we can only derive info from X_train
. The reason why we apply the derived mean and stdev (from X_train
) to transform X_test
as well, is to have the same "apple-to-apple" comparison for y_test
and y_pred
.
By the way, if the train/test data is split properly without bias, and that the data is sufficiently large, both datasets would have the same approximation to the population mean and stdev.
Answered By - perpetual student
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.