Wednesday, November 16, 2022

[FIXED] Area under the histogram is not 1 when using density in plt.hist

November 16, 2022 distribution, histogram, matplotlib, probability-density, python No comments

Issue

Consider the following dataset with random data:

test_dataset = np.array([ -2.09601881,  -4.26602684,   1.09105452,  -4.59559669,
         1.05865251,  -0.93076762, -14.70398945, -18.01937129,
         4.64126152, -10.34178822,  -9.46058493,  -5.66864965,
        -3.17562022,  15.7030379 ,  10.59675205,  -5.80882413,
       -24.00604149,  -4.81518663,  -1.94333927,   1.18142171,
        12.72030312,   3.84917581,  -0.4468796 ,  11.91828567,
       -17.99171774,   9.35108712,  -5.57233376,   5.77547128,
         5.49296099, -10.96132844, -18.75174336,   5.27843303,
        25.73548956, -21.58043021, -14.24734733,  12.57886018,
       -22.10002076,   1.72207555,  -6.0411867 ,  -3.63568527,
         7.26542117,  -0.21449529,  -6.64974714,  -0.94574606,
        -4.23339431,  16.76199734, -12.42195793,  18.965854  ,
       -23.85336123, -15.55104466,   6.17215868,   7.34993316,
         8.62461351, -16.30482638, -16.35601099,   1.96857833,
        18.74440399, -22.48374434, -10.895831  , -10.14393648,
       -17.62768751,   4.83388855,  20.1578181 ,   6.04299626,
         0.97198296,  -3.40889754, -10.62734293,   1.70240472,
        20.4203839 ,  10.26751364,  15.47859675, -10.97940064,
         1.82728251,   4.22894717,   8.31502887,  -5.48502811,
        -1.09244874, -11.32072796, -24.88520436,  -7.42108403,
        19.4200716 ,   4.82704045, -12.46290135, -15.18466755,
         6.37714692, -11.06825059,   5.10898588,  -9.07485484,
         1.63946084, -12.2270078 ,  12.63776832, -25.03916909,
         2.42972082, -14.22890171,  18.2199446 ,   6.9819771 ,
       -12.07795089,   2.59948596, -16.90206575,   6.35192719,
         7.33823106, -23.69653447, -11.66091871, -19.40251179,
       -12.64863792,  11.04004231,  13.7247356 , -16.36107329,
        20.43227515,  17.97334692,  16.92675175,  -5.62051239,
        -8.66304184,  -8.40848514, -23.20919855,   0.96808137,
        -5.03287253,  -3.13212582,  18.81155666,  -8.27988284,
         3.85708447,  12.43039322,  17.98003878,  18.11009997,
        -3.74294421, -16.62276121,   9.4446743 ,   2.2060981 ,
         8.34853736,  14.79144713,  -1.91113975,  -5.17061419,
         4.53451746,   8.19090358,   7.98343201,  11.44592322,
       -16.9132677 , -25.92554857,  10.10638432,  -8.09236786,
        20.8878207 ,  19.52368296,   0.85858125,   2.61760415,
         9.21360649,  -8.1192651 ,  -6.94829273,   2.73562447,
        13.40981323,  -9.05018331, -17.77563166, -21.03927199,
         4.10415845,  -1.31550732,   5.68284828,  15.08670773,
       -19.78675315,  12.94697869, -11.51797637,   1.91485992,
        16.69417993, -16.04271622,  -1.14028558,   9.79830109,
       -18.58386093,  -7.52963269, -10.10059878, -25.2194216 ,
        -0.10598426, -15.77641532, -14.15999125,  14.35011271,
        11.15178588, -14.43856266,  15.84015226,  -3.41221883,
        11.90724469,   0.57782081,  18.82127466,  -6.01068727,
       -19.83684476,   2.20091942,  -1.38707755,  -8.62821053,
       -11.89000913, -11.69539815,   5.70242019,  -3.83781841,
         5.35894135,  -0.30995954,  21.76661212,   8.52974329,
        -9.13065082, -11.06209   , -12.00654618,   2.769838  ,
       -12.21579496, -27.2686534 ,  -4.58538197,  -6.94388425])

I'd like to plot normalized histogram of it, so in the plt.hist options I choose density=True:

import numpy as np
import matplotlib.pyplot as plt

data1, bins, _ = plt.hist(test_dataset, density=True);
print(np.trapz(data1))
print(sum(data1))

which outputs the following histogram:

0.18206124014272715

0.18866449755723017

From matplotlib documentation:

The density parameter, which normalizes bin heights so that the integral of the histogram is 1. The resulting histogram is an approximation of the probability density function.

But from my example it is clearly seen that the integral of the histogram is NOT 1 and strongly depends on the number of bins: if I specify it for example to be 40 the sum will increase:

data1, bins, _ = plt.hist(test_dataset, density=True);
print(np.trapz(data1))
print(sum(data1))

0.7508847002777762

0.7546579902289207

Is it incorrect description in documentation or I misunderstand some issues here?

Solution

you do not calculate the area, area you should calculate as follow (in your example):

sum(data1 * np.diff(bins)) == 1

Answered By - amirhm

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 16, 2022

[FIXED] Area under the histogram is not 1 when using density in plt.hist

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels