Issue
I'm trying to display some data in a histogram, but when the data is too tightly clustered, I get either an empty graph, or one I believe to be inaccurate.
Consider the following code:
import numpy as np
from matplotlib import pyplot as plt
# Generate data
nums = np.random.rand(1000)+1000
# Make Histogram
plt.hist(nums, bins=1000, alpha=0.6, color='blue')
plt.xlim([900,1100])
plt.yscale('linear')
plt.grid(True)
plt.show()
This give the following graph:
However, if I change the xlim values to:
plt.xlim([990,1010])
I get:
If I change it, yet again, to
plt.xlim([999,1001])
I get
With each bin covering a smaller range of numbers, I would've expected the peaks of the bins to decrease, rather than increase. Is there something I'm not understanding here, or is this a problem with matplotlib? (Note: This seems very similar to Empty histogram in matplotlib - data in small interval, but I think I've laid out the problem more explicitly and noticed an additional problem even when the resulting plots are not blank (i.e. highest value of a bin was greater for the narrower bins of my 3rd plot than it was for the second)
Solution
When dealing with random data it's always a good idea to set a seed in order to guarantee you're dealing with the same data across each run.
import numpy as np
from matplotlib import pyplot as plt
# Generate data
np.random.seed(1000)
nums = np.random.rand(1000)+1000
Even when playing with the exact same data we face the same problem you have stated. To illustrate that I'll show my plot with four different x-axis limits:
fig, axs = plt.subplots(2, 2)
bins = 100
n0, bins0, patches0 = axs[0,0].hist(nums, bins=bins, color='blue')
axs[0,0].set_xlim([900,1100])
axs[0,0].grid()
n1, bins1, patches1 = axs[0,1].hist(nums, bins=bins, color='blue')
axs[0,1].set_xlim([990,1010])
axs[0,1].grid()
n2, bins2, patches2 = axs[1,0].hist(nums, bins=bins, color='blue')
axs[1,0].set_xlim([999,1002])
axs[1,0].grid()
n3, bins3, patches3 = axs[1,1].hist(nums, bins=bins, color='blue')
axs[1,1].set_xlim([999.9,1001.1])
axs[1,1].grid()
plt.show()
And yeah, the problem appears even with 100 bins instead of 1000. But if you check the information within the n
and the bins
, they are all the same (as they should be, since the plot is exactly the same).
It's possible to check that by doing
print((n0 == n1).all() and (n0 == n2).all() and (n0 == n3).all())
# True
If the data are the same but the plots are not, it seems pretty much a low resolution problem. Here you can see two more pictures:
You can try this out with greater resolutions in your local PC, but this actually solves the case.
With low resolution figures there are more bins to plot than pixels beeing used, so matplotlib
must be doing some kind of data sampling. That's why the plot changes everytime you zoom it in or out.
If you set a higher size for your plot, than it'll be able to show a larger number of bins, leading your plot to be more trustworthy.
Answered By - Ralubrusto
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.