Issue
I am not an expert on logistic regression, but I thought when solving it using lgfgs
it was doing optimization, finding local minima for the objective function. But every time I run it using scikit-learn
, it is returning the same results, even when I feed it a different random state.
Below is code that reproduces my issue.
First set up the problem by generating data
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import datasets
# generate data
X, y = datasets.make_classification(n_samples=1000,
n_features=10,
n_redundant=4,
n_clusters_per_class=1,
random_state=42)
# Set up the test/training data
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25)
Second, train the model and inspect results
# Set up a different random state each time
rand_state = np.random.randint(1000)
print(rand_state)
model = LogisticRegression(max_iter=1000,
solver='lbfgs',
random_state=rand_state)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
conf_mat = metrics.confusion_matrix(y_test, y_pred)
print(y_pred[:20],"\n", conf_mat)
I get the same y_pred
(and obviously confusion matrix) every time I run this even though I'm using the lbfgs
solver with a different random state each run. I'm confused, as I thought this was a stochastic solver that was traveling down a gradient into a local minimum.
Maybe I'm not properly randomizing the initial state? I haven't been able to figure it out from the documentation.
Discussion of Related Question
There is a related question, which I didn't find during my research:
Does logistic regression always find global optimum, assuming that the optimisation converges?
The answer there is that the cost function is convex, so if the numerical solution is well-behaved, it will find a global minimum. That is, there aren't a bunch of local minima that your optimization algorithm will get stuck in: it will reach the same (global) minimum each time (perhaps depending on the solver you choose?).
However, in the comments someone pointed out, depending on what solvers you choose there are cases when you will not reach the same solution, that it depends on the random_state
parameter. At the very least, I think this would be helpful to resolve.
Solution
First, let me put in the answer what got this closed as duplicate earlier: a logistic regression problem (without perfect separation) has a global optimum, and so there are no local optima to get stuck in with different random seeds. If the solver converges satisfactorily, it will do so on the global optimum. So the only time random_state
can have any effect is when the solver fails to converge.
Now, the documentation for LogisticRegression
's parameter random_state
states:
Used when
solver
== ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data. [...]
So for your code, with solver='lbfgs'
, indeed there is no expected effect.
It's not too hard to make sag
and saga
fail to converge, and with different random_state
s to end at different solutions; to make it easier, set max_iter=1
. liblinear
apparently does not use the random_state
unless solving the dual, so also setting dual=True
admits different solutions. I found that thanks to this comment on a github issue (the rest of the issue may be worth reading for more background).
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.