Wednesday, November 29, 2023

[FIXED] Create large random boolean matrix with numpy

November 29, 2023 numpy, python, random No comments

Issue

I am trying to create a huge boolean matrix which is randomly filled with True and False with a given probability p. At first I used this code:

N = 30000
p = 0.1
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])

But sadly it does not seem to terminate for this big N. So I tried to split it up into the generation of the single rows by doing this:

N = 30000
p = 0.1
mask = np.empty((N, N))
for i in range (N):
     mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])            
     if (i % 100 == 0):
          print(i)

Now, there happens something strange (at least on my device): The first ~1100 rows are very fastly generated - but after it, the code becomes horribly slow. Why is this happening? What do I miss here? Are there better ways to create a big matrix which has True entries with probability p and False entries with probability 1-p?

Edit: As many of you assumed that the RAM will be a problem: As the device which will run the code has almost 500GB RAM, this won't be a problem.

Solution

Really surprised no one has mentioned this solution yet..

This line

np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])

runs NXN Bernoulli Trials. (In your case, 900M of them!) A Bernoulli trial is just a random experiment with two possible outcomes, with probabilities p and 1-p.

The sum of N Bernoulli trials, each with probability p, can be modeled by the Binomial distribution.

We can leverage this fact to randomly simulate the total count of True elements. With NumPy,

import numpy as np

N = 30000
p = 0.1

# Build a random number generator
rng = np.random.default_rng(123)

# Randomly determine the total number of True values
Ntrue = rng.binomial(n=N*N, p=p, size=1)[0]  # 90016776

Now we can randomly determine the position of each True element by randomly choosing row and col indices without replacement.

# Randomly determine true position
position_ids = rng.choice(a=N*N, size=Ntrue, replace=False)
positions = np.unravel_index(position_ids, shape=(N,N))

And now we can populate a compressed sparse row (CSR) matrix.

from scipy import sparse

# Build a compressed sparse row matrix with the constructor:
# csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
result = sparse.csr_matrix((np.ones(shape=Ntrue), positions), shape=(N,N))

Notice this solution avoids storing and computing 900M boolean values.

Funny enough, I wrote about a nearly identical problem before stumbling upon this question.

Answered By - Ben

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 29, 2023

[FIXED] Create large random boolean matrix with numpy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels