Issue
I updated the Scipy version in a project from 1.8.1 to 1.9.0 while keeping the Numpy version at 1.24.4. However, this change caused the project to break. The below code snippet reproduces the issue, which involves fitting a probability distribution using the beta-binomial distribution from Scipy.
from collections import Counter
import numpy as np
from scipy import stats
from scipy.optimize import differential_evolution
def func_nll(free_params, *args):
"""Negative log likelihood used in the optimizer to fit the best probability distribution available.
More information on the optimizer used in
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.differential_evolution.html
Arguments:
free_params: probability distribution parameters under tuning. Depending on the probability
distribution to fit, there can be one or more
*args: custom distribution arguments, including the
Returns:
Negative log likelihood value used in the optimizer
"""
dist, x, shift = args
# negative log-likelihood function
negativeLogLikelihood = -dist.logpmf(x, loc=shift, *free_params).sum()
if np.isnan(negativeLogLikelihood): # occurs when x is outside of support
negativeLogLikelihood = np.inf # we don't want that
return negativeLogLikelihood
def fit(x, shift):
bounds = [(0, n+1), (1, 700), (1, 700)]
dist = stats.betabinom
opt_res = differential_evolution(func_nll, bounds, maxiter=100, tol=0.001, seed=45,
args=(dist, x, shift))
return opt_res
x = [28, 67, 69, 69, 71, 69, 55, 65, 68, 67, 67, 68, 69, 60, 40]
n = max(x)
mode_value = Counter(x).most_common(1)[0][1]
n_zero = 4288 # high number of zeroes
is_zero_inflated = (n_zero >= mode_value)
shift_loc = 0
if is_zero_inflated:
shift_loc = 1
b_param = fit(x, shift=shift_loc)
print(b_param)
with scipy==1.8.1 the above code snippet prints out:
df = fun(x) - f0
fun: 49.59657327021416
message: 'Optimization terminated successfully.'
nfev: 2802
nit: 61
success: True
x: array([70.00205357, 6.28931939, 1.0005225 ])
When updating the scipy version to 1.9.0 the above snippet prints out:
df = fun(x) - f0
fun: inf
message: 'Maximum number of iterations has been exceeded.'
nfev: 9129
nit: 100
success: False
x: array([ 45.83055307, 3.17350763, 596.7724981 ])
I'm wondering if there was a breaking change in the implementation of the beta-binomial distribution in Scipy 1.9.x that could explain this issue.
Solution
I'm wondering if there was a breaking change in the implementation of the beta-binomial distribution in Scipy 1.9.x that could explain this issue.
Perhaps. Check the following in SciPy 1.8.x:
stats.betabinom(10.5, 2, 2).pmf([1, 2])
I get NaNs in recent versions, which is a bug fix. SciPy's betabinomial distribution is defined only for integer n
. If it produced real values for non-integer n
before, it was probably an unintential (and not mathematically sound) interpolation of some sort by the underlying library. I believe something like that was fixed in the timeframe of interest.
In any case, it could have been a change in betabinom
, differential_evolution
, or even NumPy. For instance, NumPy doesn't guarantee bitwise reproducibility in their random number streams between major versions (e.g. 1.25 to 1.26), so SciPy can't guarantee exact reproducibility of stochastic algorithms like differential_evolution
between major versions. (That said, SciPy also doesn't guarantee exact reproducibility in light of bug fixes and enhancements. There are guards against major regressions and backward incompatible API changes, but not that a complex algorithm like differential_evolution
will always return precisely the same solution or even succeed in exactly the same space of problems.)
But scipy.stats.fit
(new in 1.9.0) solves your problem. It uses differential_evolution
to numerically minimize the negative log likelihood, like you are, but it only passes integer n
as a parameter, and it uses a penalized LLF that is robust if NaNs or infs arise.
from collections import Counter
import numpy as np
from scipy import stats
x = [28, 67, 69, 69, 71, 69, 55, 65, 68, 67, 67, 68, 69, 60, 40]
n = max(x)
mode_value = Counter(x).most_common(1)[0][1]
n_zero = 4288 # high number of zeroes
is_zero_inflated = (n_zero >= mode_value)
shift_loc = 0
if is_zero_inflated:
shift_loc = 1
# new code begins here
bounds = dict(n=(0, n+1), a=(1, 700), b=(1, 700), loc=(shift_loc, shift_loc))
res = stats.fit(stats.betabinom, x, bounds=bounds)
# params: FitParams(n=70.0, a=6.282699443065473, b=1.0, loc=1.0)
# success: True
# message: 'Optimization terminated successfully.'
Update based on request in comments: or, to minimize changes to the original code snippet, just call differential_evolution
with the integrality
parameter.
opt_res = differential_evolution(func_nll, bounds, integrality=[True, False, False], args=(dist, x, shift))
Answered By - Matt Haberland
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.