Sunday, January 30, 2022

[FIXED] Central Limit Theorem: Sample means do not follow a normal distribution

January 30, 2022 jupyter-notebook, matplotlib, simulation, statistics No comments

Issue

The Problem

Good evening.

I am learning about the Central Limit Theorem. As practice, I ran simulations in an attempt to find the mean of a fair die (I know, a toy problem).

I took 4000 samples, and in each sample I rolled a die 50 times (screenshot of the code at the bottom). For each of these 4000 samples I computed the mean. Then, I plotted these 4000 sample means in a histogram (with bin size 0.03) using matplotlib.

Here is the result:

Question

Why aren't the sample means normally distributed given that the conditions for CLT (sample size >= 30) were respected?

Specifically, why does the histogram look like two normal distributions superimposed on top of each other? More intriguingly, why does the "outer" distribution look "discrete" with empty spaces occurring at regular intervals?

It almost seems like the result is off in a systematic way.

All help is greatly appreciated. I am very lost.

Supplementary Code

The code I used to generate the 4000 sample means.

"""
Take multiple samples of dice rolls. For
each sample, compute the sample mean.

With the sample means, plot a histogram.
By the Central Limit Theorem, the sample
means should be normally distributed.

"""

sample_means = []

num_samples = 4000

for i in range(num_samples):
    # Large enough for CLT to hold
    num_rolls = 50
    
    sample = []
    for j in range(num_rolls):
        observation = random.randint(1, 6)
        sample.append(observation)
    
    sample_mean = sum(sample) / len(sample)
    sample_means.append(sample_mean)

Solution

When num_rolls equals 50, each possible mean will be a fraction with denominator 50. So, in reality, you are looking at a discrete distribution.

To create a histogram of a discrete distribution, the bin boundaries are best placed nicely in-between the values. Using a step size of 0.03, some bin boundaries will coincide with the values, putting the double of values into one bin compared to its neighbor. Moreover, due to subtle floating point rounding problems, the result can become unpredictable when values and boundaries coincide.

Here is some code to illustrate what is going on:

from matplotlib import pyplot as plt
import numpy as np
import random

sample_means = []
num_samples = 4000

for i in range(num_samples):
    num_rolls = 50
    sample = []
    for j in range(num_rolls):
        observation = random.randint(1, 6)
        sample.append(observation)

    sample_mean = sum(sample) / len(sample)
    sample_means.append(sample_mean)

fig, axs = plt.subplots(2, 2, figsize=(14, 8))

random_y = np.random.rand(len(sample_means))
for (ax0, ax1), step in zip(axs, [0.03, 0.02]):
    bins = np.arange(3.01, 4, step)
    ax0.hist(sample_means, bins=bins)
    ax0.set_title(f'step={step}')
    ax0.vlines(bins, 0, ax0.get_ylim()[1], ls=':', color='r')  # show the bin boundaries in red
    ax1.scatter(sample_means, random_y, s=1)  # show the sample means with a random y
    ax1.vlines(bins, 0, 1, ls=':', color='r')  # show the bin boundaries in red
    ax1.set_xticks(np.arange(3, 4, 0.02))
    ax1.set_xlim(3.0, 3.3)  # zoom in to region to better see the ins
    ax1.set_title('bin boundaries between values' if step == 0.02 else 'chaotic bin boundaries')
plt.show()

PS: Note that the code would run much, much faster if instead of Python lists, the code would work completely with numpy.

Answered By - JohanC

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 30, 2022

[FIXED] Central Limit Theorem: Sample means do not follow a normal distribution

Issue

The Problem

Question

Supplementary Code

Solution

0 comments:

Post a Comment

Popular Posts

Labels