Issue
I am training a linear model with two packages separately.
However, I realize there is a huge difference between those two results, in terms of variables' coefficients.
def test(x, y, model):
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101)
regr = linear_model.LinearRegression()
regr.fit(x_train, y_train)
lr = sm.OLS(y_train, x_train).fit()
print(lr.params)
print(regr.coef_)
Above is the code I used. Surprisingly, the coefficient difference is so huge that it gives completely different predictions.
Both models list variables in the same order, so I am really confused now. Any idea about what is going wrong? Thank you!
Solution
It seems like the issue here is with how the intercept is handled in the two different packages.
With statsmodels.regression.linear_model.OLS
, an intercept is not included by default and needs to be added manually using sm.add_constant(X)
.
In contrast, sklearn.linear_model.LinearRegression
includes an intercept by default, as indicated by the parameter fit_intercept=True
. The intercept in sklearn
is stored separately from the coefficients and can be viewed with .intercept_
.
To ensure a fair comparison, both models should be configured to handle the intercept term in the same way. You can either add a constant to your statsmodels
model:
X = sm.add_constant(X)
Or disable the intercept in your sklearn
model:
LinearRegression(fit_intercept=False)
Make sure to adjust both models accordingly and then compare the coefficients again.
Answered By - DataJanitor
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.