Issue
I want to write a numpy array to an avro file. Here's an small example of the numpy array:
import numpy as np
import random
np_array = np.zeros((4,3), dtype=np.float32)
for i in range(4):
for j in range(3):
np_array[i, j] = random.gauss(0, 1)
print(np_array)
Output:
[[ 0.6490377 0.29544145 -1.109375 ]
[ 1.0881975 -0.39123887 -0.36691198]
[-1.2226632 0.8332004 0.2686829 ]
[ 1.5417658 0.4520132 -0.03081623]]
For my use case the numpy array has 5 million rows and 128 columns, so if possible I want to write the array directly to avro without spending memory converting it to a dictionary and/or Pandas DataFrame.
Solution
I answered my own question! This solution writes a 2D numpy array to avro without any conversions.
import numpy as np
import random
np_array = np.zeros((4,3), dtype=np.float32)
for i in range(4):
for j in range(3):
np_array[i, j] = random.gauss(0, 1)
print(np_array)
Output:
[[ 0.6490377 0.29544145 -1.109375 ]
[ 1.0881975 -0.39123887 -0.36691198]
[-1.2226632 0.8332004 0.2686829 ]
[ 1.5417658 0.4520132 -0.03081623]]
import fastavro
schema_dict = {
"doc": "test",
"name": "test",
"namespace": "test",
"type": "array",
"items": "float"
}
schema = fastavro.parse_schema(schema_dict)
with open(<filepath>, "wb") as f:
fastavro.writer(f, schema, np_array)
with open(<filepath>, "rb") as f:
reader = fastavro.reader(f)
for record in reader:
print(record)
Output:
[ 0.6490377 0.29544145 -1.109375 ]
[ 1.0881975 -0.39123887 -0.36691198]
[-1.2226632 0.8332004 0.2686829 ]
[ 1.5417658 0.4520132 -0.03081623]
Answered By - gasbag_1
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.