Issue
I need to convert a huge list of tuple of bytes to a numeric numpy.ndarray in a data processing task. The list, with the length of over 10 Millions, consists of tuples containing 3 450-bytes series, which looks like the example below
[
(
b'\n\x0f\n\t\x0c\x00\x00\x01\x07\x06...', # 450 bytes series
b'\x00\x0e\x00\x06\x07\x0c\n\x0e\x07...', # also 450 bytes
b'\x05\x0e\x07\t\x04\x01\x05\x07\x08...',
), # 3-byte-serie tuple
(...), # more tuples like this
... # the number of tuples is up to 10M
]
What I hope to get is a numpy.uint8 array with the shape of (10Ms, 3, 450), in which each uint8 element is corresponding to a byte in the series (e.g. b'\n\x0f\n\t'
to [10, 15, 10, 9]).
Or to be simple, I'm looking for a opposite function of element-wise numpy.ndarray.tobytes
Of course this can be realized with a simple iteration written with for
in raw python, convert the byte series to 1-dimension array with numpy.fromiter
one by one. But due to the huge amount of data, I hope use numpy to accelerate the process as much as possible. So what I want is a direct numpy function, or code with a few numpy functions without any raw python for
iteration.
p.s. I've also tried to combine numpy.fromiter
with np.frompyfunc
and use it on the numpy array of bytes generate with np.array(..., dtype = object)
, but it still doesn't seem fast enough.
Solution
I think np.frombuffer
might be what your are looking for:
import numpy as np
data = [
(
b'\n\x0f\n\t\x0c\x00\x00\x01\x07\x06', # 450 bytes series
b'\x00\x0e\x00\x06\x07\x0c\n\x0e\x07', # also 450 bytes
b'\x05\x0e\x07\t\x04\x01\x05\x07\x08',
),
(
b'\n\x0f\n\t\x0c\x00\x00\x01\x07\x06', # 450 bytes series
b'\x00\x0e\x00\x06\x07\x0c\n\x0e\x07', # also 450 bytes
b'\x05\x0e\x07\t\x04\x01\x05\x07\x08',
), # 3-byte-serie tuple
]
data_flat = np.array(data, dtype=np.bytes_).reshape(-1)
np.frombuffer(data_flat, dtype=np.uint8).reshape(2, 3, 10)
I hope this helps!
Answered By - Axel Donath
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.