Sunday, January 16, 2022

[FIXED] Sklearn equivalent to statsmodel logit

January 16, 2022 python, scikit-learn, statsmodels No comments

Issue

I have a dataframe df with

df.head()

(3, 3 and 6 levels for variables F1, F2, and F3), and run the code

from patsy.contrasts import Sum
import statsmodels.formula.api as smf

model = 'SEL ~ C(F1, Sum) + C(F2, Sum) + C(F3, Sum)'
model = smf.logit(model, data=df)
model_fit = model.fit()

1) What would be the equivalent of the above using sklearn?

2) What would be the equivalent of the above using sklearn but dropping "Sum" on the first assignment to the model variable?

Solution

You can start with something that gives you a onehot encoding, and basically removing the last level, and insert -1 in those rows which correspond to the last level:

from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn import preprocessing

def contrSum(DF,column):
    DF[column] = DF[column].astype('category')
    nlevels = len(DF[column].unique())
    dm = pd.get_dummies(DF[column],prefix=column,dtype=np.int64)
    dm.loc[dm[dm.columns[nlevels-1]]==1,dm.columns[:(nlevels-1)]] = -1
    return dm.iloc[:,:(nlevels-1)]

contrSum(df,'F1')

    F1_1    F1_2
0   1   0
1   0   1
2   1   0
3   0   1
4   0   1
... ... ...
95  1   0
96  -1  -1
97  1   0
98  0   1
99  -1  -1

Now we apply this function to all the columns, concatenate and fit:

dmat = pd.concat([contrSum(df,'F1'),contrSum(df,'F2'),contrSum(df,'F3')],axis=1)
clf = LogisticRegression(fit_intercept=True).fit(dmat,df['SEL'])

Let's plot:

prob = clf.predict_proba(dmat)[:,1]
plt.scatter(x=model_fit.fittedvalues,y=np.log(prob/(1-prob)))

Look at coefficients:

pd.DataFrame({'sk_coef':clf.coef_[0],'smf_coef':model_fit.params[1:]})

    sk_coef smf_coef
C(F1, Sum)[S.1] 0.007327    0.023707
C(F1, Sum)[S.2] -0.337868   -0.375865
C(F2, Sum)[S.1] -0.174720   -0.192799
C(F2, Sum)[S.2] 0.018365    0.031589
C(F3, Sum)[S.1] 0.197189    0.251827
C(F3, Sum)[S.2] 0.058658    0.045554
C(F3, Sum)[S.3] -0.103133   -0.148508
C(F3, Sum)[S.4] -0.209002   -0.265786
C(F3, Sum)[S.5] 0.238623    0.303353

Answered By - StupidWolf

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 16, 2022

[FIXED] Sklearn equivalent to statsmodel logit

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels