Issue
I have a dataframe df
with
df.head()
(3, 3 and 6 levels for variables F1, F2, and F3), and run the code
from patsy.contrasts import Sum
import statsmodels.formula.api as smf
model = 'SEL ~ C(F1, Sum) + C(F2, Sum) + C(F3, Sum)'
model = smf.logit(model, data=df)
model_fit = model.fit()
1) What would be the equivalent of the above using sklearn?
2) What would be the equivalent of the above using sklearn but dropping "Sum" on the first assignment to the model variable?
Solution
You can start with something that gives you a onehot encoding, and basically removing the last level, and insert -1 in those rows which correspond to the last level:
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn import preprocessing
def contrSum(DF,column):
DF[column] = DF[column].astype('category')
nlevels = len(DF[column].unique())
dm = pd.get_dummies(DF[column],prefix=column,dtype=np.int64)
dm.loc[dm[dm.columns[nlevels-1]]==1,dm.columns[:(nlevels-1)]] = -1
return dm.iloc[:,:(nlevels-1)]
contrSum(df,'F1')
F1_1 F1_2
0 1 0
1 0 1
2 1 0
3 0 1
4 0 1
... ... ...
95 1 0
96 -1 -1
97 1 0
98 0 1
99 -1 -1
Now we apply this function to all the columns, concatenate and fit:
dmat = pd.concat([contrSum(df,'F1'),contrSum(df,'F2'),contrSum(df,'F3')],axis=1)
clf = LogisticRegression(fit_intercept=True).fit(dmat,df['SEL'])
Let's plot:
prob = clf.predict_proba(dmat)[:,1]
plt.scatter(x=model_fit.fittedvalues,y=np.log(prob/(1-prob)))
Look at coefficients:
pd.DataFrame({'sk_coef':clf.coef_[0],'smf_coef':model_fit.params[1:]})
sk_coef smf_coef
C(F1, Sum)[S.1] 0.007327 0.023707
C(F1, Sum)[S.2] -0.337868 -0.375865
C(F2, Sum)[S.1] -0.174720 -0.192799
C(F2, Sum)[S.2] 0.018365 0.031589
C(F3, Sum)[S.1] 0.197189 0.251827
C(F3, Sum)[S.2] 0.058658 0.045554
C(F3, Sum)[S.3] -0.103133 -0.148508
C(F3, Sum)[S.4] -0.209002 -0.265786
C(F3, Sum)[S.5] 0.238623 0.303353
Answered By - StupidWolf
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.