Wednesday, June 29, 2022

[FIXED] Comparing logistic regression in Scikit-learn (Python) and glm (R)

June 29, 2022 glm, python, r, scikit-learn No comments

Issue

I am trying to compare logistic regression in R glm stats package and Scikit-learn Python. Here is my dataset. dataset.

Here is python code

import pandas as pd
from sklearn.linear_model import LogisticRegression

df = pd.read_csv("dataset.csv")
df = df.join(pd.get_dummies(df['var2'], prefix = 'var2', drop_first= True))
df.drop(columns = ['var2'], inplace = True)

X = df.loc[:,df.columns != 'y']
y = df.y

model = LogisticRegression(fit_intercept=True, penalty = 'none' )
model.fit(X, y)
prob = model.predict_proba(X)
model.coef_

Here are the coefficients:

var1, var3, var4, var2_B, var2_C
-1.833653e-07, 2.823982e-12, 2.568188e-12, -4.116901e-13, 5.514602e-14

And here is corresponding R code:

df=read_csv(file = "dataset.csv")
glm_fit <- glm(y ~.,data = df,   family=binomial(link = 'logit'))
summary(glm_fit)

Here are coefficients:

(Intercept) -6.459e-01 
var1        -1.042e-07  
var2B       -7.731e-01  
var2C        1.880e+00  
var3        -1.124e-04  
var4         2.994e-03

It is easy to check that matrix that goes into solver is the same in both case. As you can see, coefficients are drastically different. Also ROC AUC in R comes up way better than in Python. I understand that different solvers are used, but difference in solution seems too big. Is there way to troubleshoot it?

Solution

Indeed it seems to be a matter of the lbfgs solver (the default used by sklearn) failing to work well on unscaled input data. Scaling the inputs first and modifying the coefficients accordingly, I recover basically the same coefficients you reported from glm:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_sc = scaler.fit_transform(X)
model.fit(X_sc, y)
model.coef_ / scaler.scale_

The sag and saga solvers suffer the same fate, while newton-cg actually gets close and throws convergence warnings. Increasing the number of iterations just adds a warning about rounding errors preventing better convergence.

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, June 29, 2022

[FIXED] Comparing logistic regression in Scikit-learn (Python) and glm (R)

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels