Issue
When running a logistic regression, the coefficients I get using statsmodels
are correct (verified them with some course material). However, I am unable to get the same coefficients with sklearn
. I've tried preprocessing the data to no avail. This is my code:
Statsmodels:
import statsmodels.api as sm
X_const = sm.add_constant(X)
model = sm.Logit(y, X_const)
results = model.fit()
print(results.summary())
The relevant output is:
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.2382 3.983 -0.060 0.952 -8.045 7.569
a 2.0349 0.837 2.430 0.015 0.393 3.676
b 0.8077 0.823 0.981 0.327 -0.806 2.421
c 1.4572 0.768 1.897 0.058 -0.049 2.963
d -0.0522 0.063 -0.828 0.407 -0.176 0.071
e_2 0.9157 1.082 0.846 0.397 -1.205 3.037
e_3 2.0080 1.052 1.909 0.056 -0.054 4.070
Scikit-learn (no preprocessing)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
results = model.fit(X, y)
print(results.coef_)
print(results.intercept_)
The coefficients given are:
array([[ 1.29779008, 0.56524976, 0.97268593, -0.03762884, 0.33646097,
0.98020901]])
And the intercept/constant given is:
array([ 0.0949539])
As you can see, regardless of which coefficient corresponds to which variable, the numbers given by sklearn
don't match the correct ones from statsmodels
. What am I missing? Thanks in advance!
Solution
Thanks to a kind soul on reddit, this was solved. To get the same coefficients, one has to negate the regularisation that sklearn
applies to logistic regression by default:
model = LogisticRegression(C=1e8)
Where C
according to the documentation is:
C : float, default: 1.0
Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
Answered By - lfo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.