Issue
I am trying to fit a curve to a histogram, but the resulting curve is flat even though the histogram was not. How can I fit the curve correctly?
My current code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import pandas as pd
import scipy.optimize as optimization
x = np.random.normal(1e-10, 1e-7, size=10000)
def func(x, a, b, c):
return a * (np.exp(-b*(x-c)**2))
bins=25
logbins = np.logspace(np.log10(1.0E-10),np.log10(1E-07),bins)
bin_heights, bin_borders, _ = plt.hist(x, bins=logbins, edgecolor='black', color='b')
bin_centers = bin_borders[:-1] + np.diff(bin_borders) / 2
x0 = np.array([0.0, 0.0, 0.0])
popt,cov = optimization.curve_fit(func, bin_centers, bin_heights,x0)
a,b,c=popt
popt, pcov = curve_fit(func, bin_centers, bin_heights, p0=[a,b,c])
x_interval_for_fit = np.linspace(bin_borders[0], bin_borders[-1], 1000)
plt.plot(x_interval_for_fit, func(x_interval_for_fit, *popt), label='Fit',color='r')
plt.xscale('log')
plt.show()
The result:
Solution
You are getting bad results because the function you are using to fit the histogram doesn't look like the shape of the histogram at all. By using a simple second order interpolation function, the results are a lot better (though you might say not ideal):
def func(x, a, b, c):
return a * x**2 + b * x + c # Simple 2nd-order polynomial
Using it with your code (you can remove the two optimisation steps and do that only once), I got the following result:
One thing that may have been unintentional in your code is that, in spite of the fact that you created a normal distribution, on your histogram you decided to bin them in a surprising way: considering only what you have on one side of the distribution (since you start from 1e-10, and your distribution is centred in 1e-10) and you increase your bin size logarithmically to the right, you'll end up with more points on the larger bins. Also, you are ignoring more than half of your points (those the are smaller than 1e-10, check hist
's documentation). You can check it by: np.sum(bin_heights)
, you'll see that the count is less than half of len(x)
.
If all I said above was unintentional and you really wanted to find out the original generating gaussian of the random numbers (but with counts instead of density), you should do that directly over your unmodified histogram. One problem though, since you are using a very small standard deviation, is that the problem is very badly conditioned (in the sense that the Jacobian has widely varying values, depending on which dimension you go). You can have a feeling for "how hard" it is for the optimiser to understand the influence of each parameter by calculating the error manually and seeing how it changes (or doesn't) as you vary the parameters. Since you are doing a least-squares optimisation, your cost function looks like this:
f = lambda a, b, c: sum((bin_heights - func(bin_centers, a, b, c))**2)
If you play with this function a bit around 0 initial condition, to see what the optimiser is seeing, you'll notice why convergence is difficult:
f(0, 0, 0) # 8_407_566
f(0, 0, 1) # 8_407_566, i.e. no change
f(0, 1, 0) # 8_407_566, i.e. no change
f(1, 0, 0) # 8_387_591, clearly an improvement
This is not a mathematically rigorous argument (that comes from the ill-conditioning of the Jacobian when a
and b
are close to 0), but it gives a good intuition of what is going wrong.
Thus, you need a good "first guess". Of course, in this problem, since you know that you are trying to fit a gaussian, you could use the average, max value and standard deviation as initial parameters (in your case, the inverse of the square of the standard deviation). The code below shows a set of "less tuned" initial parameters that also work:
def func(x, a, b, c): # Original function
return a * np.exp(- b * (x - c)**2)
bins=25 # Let hist do its magic
bin_heights, bin_borders, _ = plt.hist(x, bins=bins, edgecolor='black', color='b')
bin_centers = bin_borders[:-1] + np.diff(bin_borders) / 2
x0 = np.array([600.0, 1.0e7, 0.0])
popt, cov = curve_fit(func, bin_centers, bin_heights, x0)
x_interval_for_fit = np.linspace(bin_borders[0], bin_borders[-1], 1000)
plt.plot(x_interval_for_fit, func(x_interval_for_fit, *x0), label='Initial guess',color='y')
plt.plot(x_interval_for_fit, func(x_interval_for_fit, *popt), label='Fit',color='r')
plt.show()
And the result, with the curve created by the initial guess (in yellow) and the result of the interpolation (in red):
Answered By - nonDucor
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.