Saturday, September 17, 2022

[FIXED] sklearn cross_val_score() returning same MSE for different degree polynomials

September 17, 2022 machine-learning, pandas, python, scikit-learn No comments

Issue

The title is clear, below a code to fully reproduce my example:

import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split

np.random.seed(1)

url = 'https://raw.githubusercontent.com/selva86/datasets/master/Auto.csv'
data = pd.read_csv(url, na_values='?').dropna()

X,y = data['horsepower'], data.mpg

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5)

lm = LinearRegression()

for i in range(1,6):
    poly = PolynomialFeatures(degree=i)
    X_train_poly = poly.fit_transform(X_train.values.reshape(-1,1))
    model = lm.fit(X_train_poly, y_train)
    scores_poly = cross_val_score(model, X_test.values.reshape(-1,1),y_test,scoring='neg_mean_squared_error',
                            cv = len(X_test), n_jobs = -1)
    
    print(f'Degree-{i} polynomial, MSE: {abs(scores_poly).mean()}')

Output:

Degree-1 polynomial, MSE: 25.214173196098535
Degree-2 polynomial, MSE: 25.214173196098535
Degree-3 polynomial, MSE: 25.214173196098535
Degree-4 polynomial, MSE: 25.214173196098535
Degree-5 polynomial, MSE: 25.214173196098535

I would expect to get different MSE for different degree polynomial..

----------EDIT-----------

I included the following line to the above for loop:

X_test_poly = poly.fit_transform(X_test)

And changed the scores_poly to use X_test_poly instead of X_test. Now I get different values for the MSE.

(Below the reformulated fully reproducible code)

import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split

np.random.seed(1)

url = 'https://raw.githubusercontent.com/selva86/datasets/master/Auto.csv'
data = pd.read_csv(url, na_values='?').dropna()

X,y = data['horsepower'].values.reshape(-1,1), data.mpg.values.reshape(-1,1)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5)

lm = LinearRegression()

for i in range(1,6):
    poly = PolynomialFeatures(degree=i)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.fit_transform(X_test)
    model = lm.fit(X_train_poly, y_train)
    scores_poly = cross_val_score(model, X_test_poly,y_test,scoring='neg_mean_squared_error',
                            cv = len(X_test), n_jobs = -1)
    
    print(f'Degree-{i} polynomial, MSE: {abs(scores_poly).mean()}')

Output:

Degree-1 polynomial, MSE: 25.214173196098535
Degree-2 polynomial, MSE: 18.678068281482577
Degree-3 polynomial, MSE: 18.85993623145088
Degree-4 polynomial, MSE: 19.085567664927655
Degree-5 polynomial, MSE: 18.79973878463155

I guess this is the correct approach, but I wonder why.. could anyone clarify?

Solution

With every cross validation, the model you provided will be fitted with the training data. In your first code, even though you fitted the model with a polynomial model = lm.fit(X_train_poly, y_train) , with every iteration inside cross_val_score you will refit a linear model with X_test. Because X_test is only of degree one, with every iteration, you are only scoring the linear model with degree one.

You can actually see what the PolynomialFeatures does (see also the documentation) :

X,y = data[['horsepower']], data['mpg']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5)

poly = PolynomialFeatures(degree=3)
poly.fit_transform(X_train)[:10]
 
array([[1.000000e+00, 1.700000e+02, 2.890000e+04, 4.913000e+06],
       [1.000000e+00, 5.400000e+01, 2.916000e+03, 1.574640e+05],
       [1.000000e+00, 6.500000e+01, 4.225000e+03, 2.746250e+05],
       [1.000000e+00, 1.100000e+02, 1.210000e+04, 1.331000e+06],
       [1.000000e+00, 8.800000e+01, 7.744000e+03, 6.814720e+05],
       [1.000000e+00, 1.500000e+02, 2.250000e+04, 3.375000e+06],
       [1.000000e+00, 1.980000e+02, 3.920400e+04, 7.762392e+06],
       [1.000000e+00, 8.800000e+01, 7.744000e+03, 6.814720e+05],
       [1.000000e+00, 1.400000e+02, 1.960000e+04, 2.744000e+06],
       [1.000000e+00, 1.600000e+02, 2.560000e+04, 4.096000e+06]])

The first column is the intercept (all ones), second column your X_train or horsepower followed by square of horsepower and then cubic of horsepower . LinearRegression() fits this matrix and does not do any transformation of the input for you.

So note that you most likely what to fit the linear model with LinearRegression(intercept = False) since the intercept is already in the polynomial transformation.

What you consider doing is using Pipeline as shown in this sklearn documentation :

from sklearn.pipeline import Pipeline

for i in range(1,6):
    model = Pipeline([('poly', PolynomialFeatures(degree=i)),
    ('linear', LinearRegression(fit_intercept=False))])
    scores_poly = cross_val_score(model, X_test,y_test,scoring='neg_mean_squared_error',
                            cv = len(X_test), n_jobs = -1)
    
    print(f'Degree-{i} polynomial, MSE: {abs(scores_poly).mean()}')

Gives :

Degree-1 polynomial, MSE: 22.2130277556231
Degree-2 polynomial, MSE: 17.01069056186582
Degree-3 polynomial, MSE: 17.283994438261868
Degree-4 polynomial, MSE: 17.401530563158406
Degree-5 polynomial, MSE: 15.934755858114803

Answered By - StupidWolf

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, September 17, 2022

[FIXED] sklearn cross_val_score() returning same MSE for different degree polynomials

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels