Issue
I am trying to compare logistic regression in R glm stats package and Scikit-learn Python. Here is my dataset. dataset.
Here is python code
import pandas as pd
from sklearn.linear_model import LogisticRegression
df = pd.read_csv("dataset.csv")
df = df.join(pd.get_dummies(df['var2'], prefix = 'var2', drop_first= True))
df.drop(columns = ['var2'], inplace = True)
X = df.loc[:,df.columns != 'y']
y = df.y
model = LogisticRegression(fit_intercept=True, penalty = 'none' )
model.fit(X, y)
prob = model.predict_proba(X)
model.coef_
Here are the coefficients:
var1, var3, var4, var2_B, var2_C
-1.833653e-07, 2.823982e-12, 2.568188e-12, -4.116901e-13, 5.514602e-14
And here is corresponding R code:
df=read_csv(file = "dataset.csv")
glm_fit <- glm(y ~.,data = df, family=binomial(link = 'logit'))
summary(glm_fit)
Here are coefficients:
(Intercept) -6.459e-01
var1 -1.042e-07
var2B -7.731e-01
var2C 1.880e+00
var3 -1.124e-04
var4 2.994e-03
It is easy to check that matrix that goes into solver is the same in both case. As you can see, coefficients are drastically different. Also ROC AUC in R comes up way better than in Python. I understand that different solvers are used, but difference in solution seems too big. Is there way to troubleshoot it?
Solution
Indeed it seems to be a matter of the lbfgs
solver (the default used by sklearn
) failing to work well on unscaled input data. Scaling the inputs first and modifying the coefficients accordingly, I recover basically the same coefficients you reported from glm
:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_sc = scaler.fit_transform(X)
model.fit(X_sc, y)
model.coef_ / scaler.scale_
The sag
and saga
solvers suffer the same fate, while newton-cg
actually gets close and throws convergence warnings. Increasing the number of iterations just adds a warning about rounding errors preventing better convergence.
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.