Tuesday, August 9, 2022

[FIXED] How to calculate the probability between two numbers from a probability distribution in python

August 09, 2022 kernel-density, matplotlib, numpy, python, seaborn No comments

Issue

I've always thought it would be useful to calculate the probability between two values on a probability distribution. While there isn't a built-in way to do this using seaborn or matplotlib, I reckon it just takes some basic calculus, right? Here is some code I found from an article on this topic:

from sklearn.neighbors import KernelDensity
import numpy as np

x = np.random.normal(loc=0.0, scale=1.0, size=1000000)

kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))

def get_probability(start_value, end_value, eval_points, kd):
    
    # Number of evaluation points 
    N = eval_points                                      
    step = (end_value - start_value) / (N - 1)  # Step size

    x = np.linspace(start_value, end_value, N)[:, np.newaxis]  # Generate values in the range
    kd_vals = np.exp(kd.score_samples(x))  # Get PDF values for each x
    probability = np.sum(kd_vals * step)  # Approximate the integral of the PDF
    return probability.round(4)

get_probability(x.mean() - x.std(), x.mean() + x.std(), 100, kd)

0.6338

This returns a probability that converges at 0.6338. This confused me, as the 68-95-99.7 rule states that the probability of a value being within one standard deviation of the mean in either direction should be 68%.

I decided to run another test by calculating the probability between the median and max of a randomly generated sample, figuring it should converge close to 50%:

x = np.random.randint(100, size=(1000000))

# sns.kdeplot(x) # this is how  i'd generate a kdeplot of this data

kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))

def get_probability(start_value, end_value, eval_points, kd):
    
    # Number of evaluation points 
    N = eval_points                                      
    step = (end_value - start_value) / (N - 1)  # Step size

    x = np.linspace(start_value, end_value, N)[:, np.newaxis]  # Generate values in the range
    kd_vals = np.exp(kd.score_samples(x))  # Get PDF values for each x
    probability = np.sum(kd_vals * step)  # Approximate the integral of the PDF
    return probability.round(4)

get_probability(np.median(x), x.max(), 100, kd)

0.4946

And it's pretty close. Am I missing something here? Why am I nearly 5 percentage points off from the 68-95-99.7 rule? Is this method of generating probabilities from a probability distribution wrong? Is there a better way to find the probability between two values from a probability distribution?

EDIT: Could you potentially calculate something by using the data generated from a kdeplot?

fig, ax = plt.subplots()
sns.kdeplot(x)
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()

And implement np.interp() somehow?

More edits:

Using CDFs per @7shoe, I was able to get a way better (and correct) result for my normal distribution example:

from scipy.stats import norm
import numpy as np

np.random.seed(42)

x = np.random.normal(loc=0.0, scale=1.0, size=10000000)

norm.cdf(x.mean() + x.std()) - norm.cdf(x.mean() - x.std())

However, my curiosity is still piqued. Let's say we have a distribution that may or may not be normal. For example, let's look at Tom Brady's epa per pass from last season

import pandas as pd
import seaborn as sns
import random
import numpy as np

YEAR = 2021

data = pd.read_csv(
    'https://github.com/nflverse/nflfastR-data/blob/master/data/play_by_play_' \
    + str(YEAR) + '.csv.gz?raw=True',compression='gzip', low_memory=False
    )

df = data.loc[data.passer == 'T.Brady','epa'].copy()

# tom brady's distribution
sns.kdeplot(df)

sample_mean = []

for i in range(50):
  y = np.random.choice(df, 500)
  avg = np.mean(y)
  sample_mean.append(avg)

# distribution of sampling means - can we assume this is normal and proceed with cdfs?
sns.kdeplot(sample_mean)

Could we use sampling means or even just bootstrap resampling methods to

Make a more "normal" distribution with sampling means in order to incorporate cdfs if the initial distribution doesn't quite appear normal (this, though, would be a distribution of means rather than individual samples. Is this not encouraged?)

If the distribution already resembles a normal distribution, simply use such resampling methods to create better parametric estimates?

Solution

Computing the probability p for some interval is not overly complicated. However, it might be tricky to combine the right tools to do so. In particular, since there are several statistical approaches to do so.

1. Probability theory

Given two numbers, let's call them lower and upper, what probability is enclosed in between them? If the cumulative distribution function (CDF) F is known, it is merely p = F(upper) - F(lower). Similarly, p coincides with the area enclosed by the probability density function(PDF) f's graph on the interval [lower, upper].

However, when the CDF/PDF is unknown, it constitutes a statistical question. In a nutshell, estimating the PDF f and computing the area its graph enclosed with the interval will do. But there are several paradigms and estimation procedures to obtain it.

1. Parametric estimation

One could assume that the data x is set of IID realizations of some normal distribution, either because of prior knowledge or convenience. Then, one just needs to estimate its parameters mu (aka scale) and sigma (aka standard deviation or scale). scipy.stats provides all we need in this setting. Moreover, it offers estimation procedures as well as pdf/cdf functions for various parametric distributions.

from scipy import stats
from matplotlib import pyplot as plt

lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]

# fit parameter
loc_hat, scale_hat = stats.norm.fit(x)

# probability 
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)

# plot 
x_axis = np.linspace(-5, 7, 1000)
plt.title('1. Parametric Estimation', fontsize=18)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.fill_between(x  = np.arange(lower, upper, 0.01), 
                 y1 = stats.norm.pdf(np.arange(lower, upper, 0.01), loc=loc_hat, scale=scale_hat) ,
                 facecolor='red',
                 alpha=0.35)
plt.text(x=0.1, y=0.1, s= 'p=' + str(round(p, 3)))
plt.show()

which yields

2. Non-parametric estimation

In the absence of a parametric assumption, various techniques exist to estimate the density directly (rather than identifying it by estimated parameters as seen above). Kernel density estimation is the most popular variant to do so. In this case, as alluded in the question, scikit-learn is an ideal tool. However, in the absence of an analytical CDF, we need to compute the area enclosed by the density's graph over the interval [lower, upper] directly.

In contrast to previous answers, I'd leave this to SciPy's numerical integration routines, e.g. scipy.inegrate.quad(). The advantage is that it is lightning-fast and can be applied to any function (beyond kernel density estimates). The resulting code is as follows

from sklearn.neighbors import KernelDensity
from scipy.integrate import quad

x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]

# fit density function
f_hat = KernelDensity(bandwidth=.9, kernel='gaussian').fit(np.array(x).reshape(-1, 1))

def f_pred(x):
    '''wrapper function to compute probability'''
    return np.exp(f_hat.score_samples(np.array(x).reshape(-1, 1)))[0]

p = quad(func=f_pred, a=lower, b=upper)

# plot
plt.title('2. Non-Parametric Estimation', fontsize=18)
xaxis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, np.exp(f_hat.score_samples(xaxis.reshape(-1, 1))))
plt.fill_between(x  = np.arange(lower, upper, 0.01), 
                 y1 = np.exp(f_hat.score_samples(np.arange(lower, upper, 0.01).reshape(-1, 1))),
                 facecolor='red',
                 alpha=0.35)
plt.text(x=0.15, y=0.1, s= 'p=' + str(round(p[0], 3)))
plt.show()

and yields

Answered By - 7shoe

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, August 9, 2022

[FIXED] How to calculate the probability between two numbers from a probability distribution in python

Issue

Solution

1. Probability theory

1. Parametric estimation

2. Non-parametric estimation

0 comments:

Post a Comment

Popular Posts

Labels