Monday, November 27, 2023

[FIXED] Why are histograms incorrectly displayed when the distribution is tightly clustered?

November 27, 2023 antialiasing, dpi, histogram, matplotlib, python No comments

Issue

I'm trying to display some data in a histogram, but when the data is too tightly clustered, I get either an empty graph, or one I believe to be inaccurate.

Consider the following code:

import numpy as np
from matplotlib import pyplot as plt

# Generate data
nums = np.random.rand(1000)+1000

# Make Histogram
plt.hist(nums, bins=1000, alpha=0.6, color='blue')  
plt.xlim([900,1100])
plt.yscale('linear')
plt.grid(True)
plt.show()

This give the following graph:

However, if I change the xlim values to:

plt.xlim([990,1010])

I get:

If I change it, yet again, to

plt.xlim([999,1001])

I get

With each bin covering a smaller range of numbers, I would've expected the peaks of the bins to decrease, rather than increase. Is there something I'm not understanding here, or is this a problem with matplotlib? (Note: This seems very similar to Empty histogram in matplotlib - data in small interval, but I think I've laid out the problem more explicitly and noticed an additional problem even when the resulting plots are not blank (i.e. highest value of a bin was greater for the narrower bins of my 3rd plot than it was for the second)

Solution

When dealing with random data it's always a good idea to set a seed in order to guarantee you're dealing with the same data across each run.

import numpy as np
from matplotlib import pyplot as plt

# Generate data
np.random.seed(1000)
nums = np.random.rand(1000)+1000

Even when playing with the exact same data we face the same problem you have stated. To illustrate that I'll show my plot with four different x-axis limits:

fig, axs = plt.subplots(2, 2)

bins = 100

n0, bins0, patches0 = axs[0,0].hist(nums, bins=bins, color='blue')
axs[0,0].set_xlim([900,1100])
axs[0,0].grid()

n1, bins1, patches1 = axs[0,1].hist(nums, bins=bins, color='blue')
axs[0,1].set_xlim([990,1010])
axs[0,1].grid()

n2, bins2, patches2 = axs[1,0].hist(nums, bins=bins, color='blue')
axs[1,0].set_xlim([999,1002])
axs[1,0].grid()

n3, bins3, patches3 = axs[1,1].hist(nums, bins=bins, color='blue')
axs[1,1].set_xlim([999.9,1001.1])
axs[1,1].grid()

plt.show()

And yeah, the problem appears even with 100 bins instead of 1000. But if you check the information within the n and the bins, they are all the same (as they should be, since the plot is exactly the same).

It's possible to check that by doing

print((n0 == n1).all() and (n0 == n2).all() and (n0 == n3).all())
# True

If the data are the same but the plots are not, it seems pretty much a low resolution problem. Here you can see two more pictures:

This one had a fig.set_size_inches(10, 5) before plt.show
And this one had a fig.set_size_inches(100, 50) before plt.show

You can try this out with greater resolutions in your local PC, but this actually solves the case.

With low resolution figures there are more bins to plot than pixels beeing used, so matplotlib must be doing some kind of data sampling. That's why the plot changes everytime you zoom it in or out.
If you set a higher size for your plot, than it'll be able to show a larger number of bins, leading your plot to be more trustworthy.

Answered By - Ralubrusto

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 27, 2023

[FIXED] Why are histograms incorrectly displayed when the distribution is tightly clustered?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels