Issue
Long time listener, first time caller...
I know a similar question has been answered in the past (see here for other thread I have referenced), but I am still having difficulties. How can I get my regression to fit? My code is below:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
#data
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
#regression fitting
X_predict_input = np.linspace(0,10,100).reshape(-1,1)
y_train = y_train.reshape((-1,1))
X_train = X_train.reshape((-1,1))
#looping through different degree values
for i, degree in enumerate([1,3,6,9]):
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
linreg = LinearRegression().fit(X_train_poly, y_train)
result[i,:] = linreg.predict(X_predict_input)
I tried to fix the shaping issues with X_train and y_train, but after looking into each shape, I am thinking that the X_train_poly is what is driving this error...
X_train shape: (11, 1)
y_train shape: (11, 1)
X_train_poly shape: (11, 10)
Respective error message:
ValueError: shapes (100,1) and (2,1) not aligned: 1 (dim 1) != 2 (dim 0)
When I try to address the shape inconsistencies in X_train_poly by the following...
X_train_poly = poly.fit_transform(X_train).reshape((-1,1))
...I receive this error:
ValueError: Found input variables with inconsistent numbers of samples: [22, 11]
I have spent an embarrassing amount of time on this, so any insight at all would be greatly appreciated!
Thank you in advance :)
Solution
I think the problem is quite simple. You're using the PolynomialFeatures
transform to generate features for the training data but when it comes to prediction, you're not applying the same transform to the input data.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# data
np.random.seed(0)
n = 15
x = np.linspace(0, 10, n) + np.random.randn(n)/5
y = np.sin(x) + x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x.reshape((-1, 1)),
y.reshape((-1, 1)),
random_state=0)
# Check data matrices are in columns
assert(X_train.shape == (11, 1))
assert(y_train.shape == (11, 1))
# Build library of polynomial features
degree = 3
poly = PolynomialFeatures(degree)
X_train_poly = poly.fit_transform(X_train)
assert(X_train_poly.shape == (11, 4))
# Fit model
linreg = LinearRegression().fit(X_train_poly, y_train)
# Make prediction
X_predict = np.linspace(0, 10, 100).reshape(-1, 1)
X_predict_poly = poly.fit_transform(X_predict)
y_predict = linreg.predict(X_predict_poly)
assert(y_predict.shape == X_predict.shape)
Update:
To avoid the inconvenience of having to apply the transform every time you make a prediction, you might want to check out sklearn.Pipeline:
# Using a pipeline to automate the input transformation
from sklearn.pipeline import Pipeline
poly = PolynomialFeatures(degree)
model = LinearRegression()
pipeline = Pipeline(steps=[('t', poly), ('m', model)])
linreg = pipeline.fit(X_train, y_train)
y_predict2 = linreg.predict(X_predict)
assert(np.array_equal(y_predict, y_predict2))
Answered By - Bill
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.