Saturday, October 29, 2022

[FIXED] Select values from list based on bools in numpy array

October 29, 2022 arrays, numpy, pandas, performance, python No comments

Issue

For each sublist in array b, return values from list a with same position as positive boolean in b sublist (i.e. where True).

import pandas as pd
import numpy as np

a = pd.Series([1, 3, 5, 7, 9])  # values to choose from
b = np.array([[False, True, False, True, False],  # based on bools
              [False, False, False, False, False]])

out = []
for i, v in enumerate(b):
    out.append([])
    for j in range(len(e)):
        if v[j]:
            out[i].append(a[j])

out = np.array(out)  # np.array([[3,7],[]])  # result

# In first sublist, True is on index 1 and 3 which corresponds to values 3 and 7.
# In second sublist, there is not True, hence empty.

The above seems too laborious and it is possibly not making use of numpy vectorization (it is slow on large data).

Solution

Your Series is 1d; b is a 2d array. The Series also has row indices, which a plain array does not.

In [70]: a.shape, b.shape
Out[70]: ((5,), (2, 5))

In [71]: a
Out[71]: 
0    1
1    3
2    5
3    7
4    9
dtype: int64

We can use rows of b, 1d array of shape (5,) to select elements from a:

In [72]: a[b[0,:]]
Out[72]: 
1    3
3    7
dtype: int64

In [73]: a[b[1,:]]
Out[73]: Series([], dtype: int64)

Since the rows produce different length results, we can't do that selection in one step. a[b] gives an error, with the mismatch between (5,) and (2,).

It may be simpler to work with the array version of a, also 1d, but without row indices:

In [103]: A = a.to_numpy(); A
Out[103]: array([1, 3, 5, 7, 9], dtype=int64)

Applying a row of b to index that:

In [104]: A[b[0]]
Out[104]: array([3, 7], dtype=int64)

And iteratively doing that for all rows:

In [105]: [A[row] for row in b]
Out[105]: [array([3, 7], dtype=int64), array([], dtype=int64)]

We can make a (2,5) array from A, and apply the b boolean mask - but the result will be 1d, with no indication that the 2nd row did not select anything:

In [106]: np.vstack((A,A))
Out[106]: 
array([[1, 3, 5, 7, 9],
       [1, 3, 5, 7, 9]], dtype=int64)

In [107]: np.vstack((A,A))[b]
Out[107]: array([3, 7], dtype=int64)

Indexing with a row of b or b itself is what I was calling a 'whole-array' operation. But using the rows of b individually can't be done that way; it requires a Python level iteration.

There are some other ways of working with A and b:

Multiplication works, where b is treated as an array of 0 and 1s:

In [111]: A*b
Out[111]: 
array([[0, 3, 0, 7, 0],
       [0, 0, 0, 0, 0]], dtype=int64)

There's is also a masked array subclass of arrays:

In [112]: np.ma.masked_array(np.vstack((A,A)),~b)
Out[112]: 
masked_array(
  data=[[--, 3, --, 7, --],
        [--, --, --, --, --]],
  mask=[[ True, False,  True, False,  True],
        [ True,  True,  True,  True,  True]],
  fill_value=999999,
  dtype=int64)

The [105] list of arrays can turned into an object dtype array:

In [115]: np.array([A[row] for row in b],object)
Out[115]: array([array([3, 7], dtype=int64), array([], dtype=int64)], dtype=object)

This is 1d, with shape (2,). Sometimes its useful, but performance wise it is not an improvement over the list.

Answered By - hpaulj

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 29, 2022

[FIXED] Select values from list based on bools in numpy array

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels