Tuesday, January 9, 2024

[FIXED] Efficiently convert numpy array of arrays to pandas series of arrays

January 09, 2024 arrays, numpy, pandas, python No comments

Issue

How would I efficiently convert a numpy array of arrays into a list of arrays? Ultimately, I want to make a pandas Series of arrays to be a columns in a dataframe. If there is a better way to go directly to, that would also be good.

The following reproducible code solves the issue with list() or .tolist(), but either is much too slow to implement on my actual data set. I am looking for something much faster.

import numpy as np 
import pandas as pd

a = np.array([np.array([0,1,2,3]), np.array([4,5,6,7])])

s = pd.Series(a.tolist())

s = pd.Series(list(a))

This results in the shape going from a.shape = (2,4) to s.values.shape = (2,).

Solution

Your a:

In [2]: a = np.array([np.array([0,1,2,3]), np.array([4,5,6,7])])
   ...:

a is (2,4) numeric array; we could have just written a = np.array([[0,1,2,3],[4,5,6,7]]). Creating a (2,) array of arrays requires a different construction.

As others wrote, making a dataframe this is trivial:

In [3]: pd.DataFrame(a)     # dtypes int64
Out[3]: 
   0  1  2  3
0  0  1  2  3
1  4  5  6  7

But making a series from it raises an error:

In [4]: pd.Series(a)
---------------------------------------------------------------------------
...
Exception: Data must be 1-dimensional

Your question would have been clearer if it showed this error, and why then you tried the list inputs:

In [5]: pd.Series(a.tolist())
Out[5]: 
0    [0, 1, 2, 3]
1    [4, 5, 6, 7]
dtype: object
In [6]: pd.Series(list(a))
Out[6]: 
0    [0, 1, 2, 3]
1    [4, 5, 6, 7]
dtype: object

On the surface these are the same, but when we look at actual elements of the Series, we see that one contains lists, the other arrays. That's because tolist and list() create different lists from the array.

In [8]: Out[5][0]
Out[8]: [0, 1, 2, 3]
In [9]: Out[6][0]
Out[9]: array([0, 1, 2, 3])

My experience is that a.tolist() is quite fast. list(a) is equivalent to [i for i in a]; in effect it iterates on the first dimension of a, returning (in this case) a 1d array (row) each time.

Let's change a so it is a 1d object dtype array:

In [14]: a = np.array([np.array([0,1,2,3]), np.array([4,5,6,7]), np.array([1]), None])
In [15]: a
Out[15]: 
array([array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([1]), None],
      dtype=object)

Now we can make a Series from it:

In [16]: pd.Series(a)
Out[16]: 
0    [0, 1, 2, 3]
1    [4, 5, 6, 7]
2             [1]
3            None
dtype: object
In [17]: Out[16][0]
Out[17]: array([0, 1, 2, 3])

In fact we could make a series from a slice of a, the one containing just the original 2 rows:

In [18]: pd.Series(a[:2])
Out[18]: 
0    [0, 1, 2, 3]
1    [4, 5, 6, 7]
dtype: object

The tricks for constructing 1d object dtype arrays have been discussed in depth in other SO questions.

Beware that a Series like this does not behave like a multicolumn DataFrame. I've seen attempts to write csv files, where elements like this get saved as quoted strings.

Lets compare some construction times:

Make larger arrays of the 2 types:

In [25]: a0 = np.ones([1000,4],int)
In [26]: a1 = np.empty(1000, object)
In [27]: a1[:] = [np.ones(4,int) for _ in range(1000)]
# a1[:] = list(a0)   # faster

First make a DataFrame:

In [28]: timeit pd.DataFrame(a0)
136 µs ± 919 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This is the same time as for Out[3]; apparently just the overhead of making a DataFrame with a 2d array (any size) as values.

Making a series as you did:

In [29]: timeit pd.Series(list(a0))
434 µs ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [30]: timeit pd.Series(a0.tolist())
315 µs ± 5.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

both of these are longer than for the small a, reflecting the iterative nature of the creation.

And with the 1d object array:

In [31]: timeit pd.Series(a1)
103 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This is the same as for the small 1d array. As with In[28] I think there's just the overhead of creating a Series object, and then assigning it an unchanged values array.

Now constructing the a1 array is slower.

An object array like a1 is in many way just like a list - it contains pointers to objects elsewhere in memory. It can be useful if the elements differ in type (e.g. include strings or None), but computationally it is not the equivalent of a 2d array.

In sum, if the source array really is a 1d object dtype array, you can quickly create a Series from it. If it is really a 2d array, you'll need, in some way or other, convert it to a list or 1d object array first.

Answered By - hpaulj

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 9, 2024

[FIXED] Efficiently convert numpy array of arrays to pandas series of arrays

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels