Issue
How would I efficiently convert a numpy array of arrays into a list of arrays? Ultimately, I want to make a pandas Series of arrays to be a columns in a dataframe. If there is a better way to go directly to, that would also be good.
The following reproducible code solves the issue with list()
or .tolist()
, but either is much too slow to implement on my actual data set. I am looking for something much faster.
import numpy as np
import pandas as pd
a = np.array([np.array([0,1,2,3]), np.array([4,5,6,7])])
s = pd.Series(a.tolist())
s = pd.Series(list(a))
This results in the shape going from a.shape = (2,4)
to s.values.shape = (2,)
.
Solution
Your a
:
In [2]: a = np.array([np.array([0,1,2,3]), np.array([4,5,6,7])])
...:
a
is (2,4) numeric array; we could have just written a = np.array([[0,1,2,3],[4,5,6,7]])
. Creating a (2,) array of arrays requires a different construction.
As others wrote, making a dataframe this is trivial:
In [3]: pd.DataFrame(a) # dtypes int64
Out[3]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
But making a series from it raises an error:
In [4]: pd.Series(a)
---------------------------------------------------------------------------
...
Exception: Data must be 1-dimensional
Your question would have been clearer if it showed this error, and why then you tried the list inputs:
In [5]: pd.Series(a.tolist())
Out[5]:
0 [0, 1, 2, 3]
1 [4, 5, 6, 7]
dtype: object
In [6]: pd.Series(list(a))
Out[6]:
0 [0, 1, 2, 3]
1 [4, 5, 6, 7]
dtype: object
On the surface these are the same, but when we look at actual elements of the Series, we see that one contains lists, the other arrays. That's because tolist
and list()
create different lists from the array.
In [8]: Out[5][0]
Out[8]: [0, 1, 2, 3]
In [9]: Out[6][0]
Out[9]: array([0, 1, 2, 3])
My experience is that a.tolist()
is quite fast. list(a)
is equivalent to [i for i in a]
; in effect it iterates on the first dimension of a
, returning (in this case) a 1d array (row) each time.
Let's change a
so it is a 1d object dtype array:
In [14]: a = np.array([np.array([0,1,2,3]), np.array([4,5,6,7]), np.array([1]), None])
In [15]: a
Out[15]:
array([array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([1]), None],
dtype=object)
Now we can make a Series from it:
In [16]: pd.Series(a)
Out[16]:
0 [0, 1, 2, 3]
1 [4, 5, 6, 7]
2 [1]
3 None
dtype: object
In [17]: Out[16][0]
Out[17]: array([0, 1, 2, 3])
In fact we could make a series from a slice of a
, the one containing just the original 2 rows:
In [18]: pd.Series(a[:2])
Out[18]:
0 [0, 1, 2, 3]
1 [4, 5, 6, 7]
dtype: object
The tricks for constructing 1d object dtype arrays have been discussed in depth in other SO questions.
Beware that a Series like this does not behave like a multicolumn DataFrame. I've seen attempts to write csv files, where elements like this get saved as quoted strings.
Lets compare some construction times:
Make larger arrays of the 2 types:
In [25]: a0 = np.ones([1000,4],int)
In [26]: a1 = np.empty(1000, object)
In [27]: a1[:] = [np.ones(4,int) for _ in range(1000)]
# a1[:] = list(a0) # faster
First make a DataFrame:
In [28]: timeit pd.DataFrame(a0)
136 µs ± 919 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This is the same time as for Out[3]
; apparently just the overhead of making a DataFrame with a 2d array (any size) as values
.
Making a series as you did:
In [29]: timeit pd.Series(list(a0))
434 µs ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [30]: timeit pd.Series(a0.tolist())
315 µs ± 5.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
both of these are longer than for the small a
, reflecting the iterative nature of the creation.
And with the 1d object array:
In [31]: timeit pd.Series(a1)
103 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This is the same as for the small 1d array. As with In[28]
I think there's just the overhead of creating a Series
object, and then assigning it an unchanged values array.
Now constructing the a1
array is slower.
An object array like a1
is in many way just like a list - it contains pointers to objects elsewhere in memory. It can be useful if the elements differ in type (e.g. include strings or None), but computationally it is not the equivalent of a 2d array.
In sum, if the source array really is a 1d object dtype array, you can quickly create a Series
from it. If it is really a 2d array, you'll need, in some way or other, convert it to a list or 1d object array first.
Answered By - hpaulj
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.