Issue
The problem
I have a dataset with 4 numeric features and 1000 datapoints. The distribution of the values is unknown (numpy randint generates uniform ints, but this is just for the purpose of illustration). Given new datapoint (4 numbers) I want to find what is the cumulative probability (single number) of this specific datapoint.
import numpy as np
data = np.random.randint(1, 100, size=(1000, 4))
array([[28, 52, 91, 66],
[78, 94, 95, 12],
[60, 63, 43, 37],
...,
[81, 68, 45, 46],
[14, 38, 91, 46],
[37, 51, 68, 97]])
new_data = np.random.randint(1, 100, size=(1, 4))
array([[75, 24, 39, 94]])
I've tried:
Scipy
Can estimate pdf, do not know how to estimate cumulative probability. Possible ways are monte-carlo sim or integration (scipy.integrate.nquad) which is too slow for my case Integrate 2D kernel density estimate.
import scipy.stats
kde = scipy.stats.gaussian_kde(data.T)
kde.pdf(new_data)
Scikit-learn
Same as above, do not know how to estimate cumulative probability.
from sklearn.neighbors import KernelDensity
model = KernelDensity()
model.fit(data)
np.exp(model.score_samples(new_data))
Statsmodels
Can not archive anything as this only accept 1d data.
from statsmodels.distributions.empirical_distribution import ECDF
ecdf = ECDF(data[:, 0])
ecdf(new_data[0][0])
The question is, is there a fast and efficient way to estimate cumulative probability of a 4-dimentional datapoint having the provided scipy or sklearn (preferably) models?
Am I moving in the right direction or is there a completely different way to solve this? Maybe variational autoencoders is the way to go? Are there simple ways to solve this?
Solution
A multivariate ecdf at a point would just compute the fraction of observations with values smaller than the point.
Something like the following
np.random.seed(0)
data = np.random.randint(1, 100, size=(1000, 4))
new_data = np.random.randint(1, 100, size=(2, 4))
def ecdf_mv(new_data, data):
new_data = np.atleast_2d(new_data)
ecdf = []
for row in new_data:
ecdf.append((data <= row).all(1).mean())
return np.asarray(ecdf)
ecdf_mv(new_data, data)
array([0.039, 0.002])
some checks:
ecdf_mv(np.ones(4) * 100 / 2, data), 0.5**4
(array([0.067]), 0.0625)
marginal = 100 * np.ones((4, 4)) - 50 * np.eye(4)
ecdf_mv(marginal, data)
array([0.521, 0.515, 0.502, 0.54 ])
In the univariate case we can sort the data to get a fast algorithm to compute the ecdf at the original points.
I don't know if there is a data structure or algorithm that is computationally more efficient than the brute force comparison, if the ecdf has to be evaluated at many points.
Answered By - Josef
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.