Friday, February 11, 2022

[FIXED] How to fit a normal distribution for scatter plot data

February 11, 2022 curve-fitting, matplotlib, normal-distribution, pandas, python No comments

Issue

I have a dataframe with the x (column x) and y (column 1) values below I am getting the mean and stdev.

Next I am plotting them together on one chart, but it just looks very wrong, It is not just that the fitted curve is shifted, I am not sure what is wrong with it.

import matplotlib.pyplot as plt
from scipy import stats
from scipy import optimize
import numpy as np

data_sample = {'x': [0,1,2,3,4,5,6,7,8,9,10], '1': [0,1,2,3,4,5,4,3,2,1,0]}  
def test_func(x, a, b): 
    return stats.norm.pdf(x,a,b)

params, cov_params = optimize.curve_fit(test_func, data_sample['x'], data_sample['1'])

print(params)

plt.scatter(data_sample['x'], data_sample['1'], label='Data')
plt.plot(data_sample['x'] , test_func(data_sample['x'], params[0], params[1]), label='Fitted function')

plt.legend(loc='best')

plt.show()

enter image description here

Solution

The data needs to be normalized such that the area under the curve is 1. To calculate the area, when all x-values are 1 apart, you need the sum of the y-values. If the space between the x-values is larger or smaller than 1, that factor should also be included. Another way to calculate the area is np.trapz().

The normalization factor needs to be used when doing the fit. And the reverse needs to happen when drawing the curve with the original data.

When you try to fit the Gaussian pdf function to non-normalized points, the "best" fit is a very narrow, very high peak. This peak tries to approach the y=5 value in the center.

The example code below converts the lists to numpy arrays, so functions can be written more easily. Also, to draw a smooth curve, more detailed x-values are used.

import matplotlib.pyplot as plt
from scipy import stats
from scipy import optimize
import numpy as np

def test_func(x, a, b):
    return stats.norm.pdf(x, a, b)

data_sample = {'x': np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
               '1': np.array([0, 1, 2, 3, 4, 5, 4, 3, 2, 1, 0])}

# x_dist = (data_sample['x'].max() - data_sample['x'].min()) / (len(data_sample['x']) - 1)
# normalization_factor = sum(data_sample['1']) * x_dist
normalization_factor = np.trapz(data_sample['1'], data_sample['x'])  # area under the curve
params, pcov = optimize.curve_fit(test_func, data_sample['x'], data_sample['1'] / normalization_factor)

plt.scatter(data_sample['x'], data_sample['1'], clip_on=False, label='Data')
x_detailed = np.linspace(data_sample['x'].min() - 3, data_sample['x'].max() + 3, 200)
plt.plot(x_detailed, test_func(x_detailed, params[0], params[1]) * normalization_factor,
         color='crimson', label='Fitted function')

plt.legend(loc='best')
plt.margins(x=0)
plt.ylim(ymin=0)
plt.tight_layout()
plt.show()

PS: Using the original code (without the normalization), but with more detailed x values, the narrow curve would be more apparent:

x_detailed = np.linspace(min(data_sample['x']) - 1, max(data_sample['x']) + 1, 500)
plt.plot(x_detailed, test_func(x_detailed, params[0], params[1]), color='m', label='Fitted function')

Answered By - JohanC

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, February 11, 2022

[FIXED] How to fit a normal distribution for scatter plot data

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels