Issue
We wrote such code at a place where I was trained in machine learning.
My question is: Why do we transform X_test without fitting while fitting X_train at the bottom of the code?
hit = pd.read_csv("./xxx/xxx.csv")
df = hit.copy()
df = df.dropna()
y = df["Salary"]
X_ = df.drop(["Salary","League","Division","NewLeague"],axis=1).astype("float64")
dms = pd.get_dummies(df[["League","Division","NewLeague"]])
X = pd.concat([X_ , dms[["League_N","Division_W","NewLeague_N"]]],axis=1)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=42)
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
Solution
Because your X_train is reference for your training, if you fit on your test data it leaks information on how you transform your train data.
I like to think that I should never use in any way the test data except at the end of the model training for evaluation, so the test data shouldn't be involved in any fitting, scaler or model
But don"t worry, X_train should have the same distribution as X_test so it will work...
Answered By - SidoShiro92
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.