Wednesday, April 13, 2022

[FIXED] Why does LogisticRegression give the same result every time, even with different random state?

April 13, 2022 logistic-regression, scikit-learn No comments

Issue

I am not an expert on logistic regression, but I thought when solving it using lgfgs it was doing optimization, finding local minima for the objective function. But every time I run it using scikit-learn, it is returning the same results, even when I feed it a different random state.

Below is code that reproduces my issue.

First set up the problem by generating data

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import datasets

# generate data
X, y = datasets.make_classification(n_samples=1000, 
                                   n_features=10, 
                                   n_redundant=4,
                                   n_clusters_per_class=1, 
                                   random_state=42)


# Set up the test/training data
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25)

Second, train the model and inspect results

# Set up a different random state each time
rand_state = np.random.randint(1000)
print(rand_state)
model = LogisticRegression(max_iter=1000,
                           solver='lbfgs',
                           random_state=rand_state)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
conf_mat = metrics.confusion_matrix(y_test, y_pred)
print(y_pred[:20],"\n", conf_mat)

I get the same y_pred (and obviously confusion matrix) every time I run this even though I'm using the lbfgs solver with a different random state each run. I'm confused, as I thought this was a stochastic solver that was traveling down a gradient into a local minimum.

Maybe I'm not properly randomizing the initial state? I haven't been able to figure it out from the documentation.

Discussion of Related Question

There is a related question, which I didn't find during my research:
Does logistic regression always find global optimum, assuming that the optimisation converges?

The answer there is that the cost function is convex, so if the numerical solution is well-behaved, it will find a global minimum. That is, there aren't a bunch of local minima that your optimization algorithm will get stuck in: it will reach the same (global) minimum each time (perhaps depending on the solver you choose?).

However, in the comments someone pointed out, depending on what solvers you choose there are cases when you will not reach the same solution, that it depends on the random_state parameter. At the very least, I think this would be helpful to resolve.

Solution

First, let me put in the answer what got this closed as duplicate earlier: a logistic regression problem (without perfect separation) has a global optimum, and so there are no local optima to get stuck in with different random seeds. If the solver converges satisfactorily, it will do so on the global optimum. So the only time random_state can have any effect is when the solver fails to converge.

Now, the documentation for LogisticRegression's parameter random_state states:

Used when solver == ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data. [...]

So for your code, with solver='lbfgs', indeed there is no expected effect.

It's not too hard to make sag and saga fail to converge, and with different random_states to end at different solutions; to make it easier, set max_iter=1. liblinear apparently does not use the random_state unless solving the dual, so also setting dual=True admits different solutions. I found that thanks to this comment on a github issue (the rest of the issue may be worth reading for more background).

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, April 13, 2022

[FIXED] Why does LogisticRegression give the same result every time, even with different random state?

Issue

First set up the problem by generating data

Second, train the model and inspect results

Discussion of Related Question

Solution

0 comments:

Post a Comment

Popular Posts

Labels