Issue
The title is clear, below a code to fully reproduce my example:
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
np.random.seed(1)
url = 'https://raw.githubusercontent.com/selva86/datasets/master/Auto.csv'
data = pd.read_csv(url, na_values='?').dropna()
X,y = data['horsepower'], data.mpg
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5)
lm = LinearRegression()
for i in range(1,6):
poly = PolynomialFeatures(degree=i)
X_train_poly = poly.fit_transform(X_train.values.reshape(-1,1))
model = lm.fit(X_train_poly, y_train)
scores_poly = cross_val_score(model, X_test.values.reshape(-1,1),y_test,scoring='neg_mean_squared_error',
cv = len(X_test), n_jobs = -1)
print(f'Degree-{i} polynomial, MSE: {abs(scores_poly).mean()}')
Output:
Degree-1 polynomial, MSE: 25.214173196098535
Degree-2 polynomial, MSE: 25.214173196098535
Degree-3 polynomial, MSE: 25.214173196098535
Degree-4 polynomial, MSE: 25.214173196098535
Degree-5 polynomial, MSE: 25.214173196098535
I would expect to get different MSE
for different degree polynomial..
----------EDIT-----------
I included the following line to the above for
loop:
X_test_poly = poly.fit_transform(X_test)
And changed the scores_poly
to use X_test_poly
instead of X_test
. Now I get different values for the MSE
.
(Below the reformulated fully reproducible code)
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
np.random.seed(1)
url = 'https://raw.githubusercontent.com/selva86/datasets/master/Auto.csv'
data = pd.read_csv(url, na_values='?').dropna()
X,y = data['horsepower'].values.reshape(-1,1), data.mpg.values.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5)
lm = LinearRegression()
for i in range(1,6):
poly = PolynomialFeatures(degree=i)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.fit_transform(X_test)
model = lm.fit(X_train_poly, y_train)
scores_poly = cross_val_score(model, X_test_poly,y_test,scoring='neg_mean_squared_error',
cv = len(X_test), n_jobs = -1)
print(f'Degree-{i} polynomial, MSE: {abs(scores_poly).mean()}')
Output:
Degree-1 polynomial, MSE: 25.214173196098535
Degree-2 polynomial, MSE: 18.678068281482577
Degree-3 polynomial, MSE: 18.85993623145088
Degree-4 polynomial, MSE: 19.085567664927655
Degree-5 polynomial, MSE: 18.79973878463155
I guess this is the correct approach, but I wonder why.. could anyone clarify?
Solution
With every cross validation, the model you provided will be fitted with the training data. In your first code, even though you fitted the model with a polynomial model = lm.fit(X_train_poly, y_train)
, with every iteration inside cross_val_score
you will refit a linear model with X_test
. Because X_test
is only of degree one, with every iteration, you are only scoring the linear model with degree one.
You can actually see what the PolynomialFeatures
does (see also the documentation) :
X,y = data[['horsepower']], data['mpg']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5)
poly = PolynomialFeatures(degree=3)
poly.fit_transform(X_train)[:10]
array([[1.000000e+00, 1.700000e+02, 2.890000e+04, 4.913000e+06],
[1.000000e+00, 5.400000e+01, 2.916000e+03, 1.574640e+05],
[1.000000e+00, 6.500000e+01, 4.225000e+03, 2.746250e+05],
[1.000000e+00, 1.100000e+02, 1.210000e+04, 1.331000e+06],
[1.000000e+00, 8.800000e+01, 7.744000e+03, 6.814720e+05],
[1.000000e+00, 1.500000e+02, 2.250000e+04, 3.375000e+06],
[1.000000e+00, 1.980000e+02, 3.920400e+04, 7.762392e+06],
[1.000000e+00, 8.800000e+01, 7.744000e+03, 6.814720e+05],
[1.000000e+00, 1.400000e+02, 1.960000e+04, 2.744000e+06],
[1.000000e+00, 1.600000e+02, 2.560000e+04, 4.096000e+06]])
The first column is the intercept (all ones), second column your X_train or horsepower
followed by square of horsepower
and then cubic of horsepower
. LinearRegression()
fits this matrix and does not do any transformation of the input for you.
So note that you most likely what to fit the linear model with LinearRegression(intercept = False)
since the intercept is already in the polynomial transformation.
What you consider doing is using Pipeline
as shown in this sklearn documentation :
from sklearn.pipeline import Pipeline
for i in range(1,6):
model = Pipeline([('poly', PolynomialFeatures(degree=i)),
('linear', LinearRegression(fit_intercept=False))])
scores_poly = cross_val_score(model, X_test,y_test,scoring='neg_mean_squared_error',
cv = len(X_test), n_jobs = -1)
print(f'Degree-{i} polynomial, MSE: {abs(scores_poly).mean()}')
Gives :
Degree-1 polynomial, MSE: 22.2130277556231
Degree-2 polynomial, MSE: 17.01069056186582
Degree-3 polynomial, MSE: 17.283994438261868
Degree-4 polynomial, MSE: 17.401530563158406
Degree-5 polynomial, MSE: 15.934755858114803
Answered By - StupidWolf
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.