Issue
We've been trying to speed up our API call. We used to use pandas before, then moved to numpy (as we believe it's faster than pandas) and now we're applying numba to it in order to speed it up even further. I have managed to apply numba to my numeric arrays really well, but I'm still struggling with the string (nominal) array. Can't find the answers I need on numba's website nor here on StackOverf.
Below I show you a simple version of the function we want to have with the type of procedures I'm performing on my string array, very simple stuff. I've looked for solutions around and many posts here state that numba is now working with strings so I believe there could be a solution to my code since I'm only using pretty simple data manipulations.
# Loading packages
import numpy as np
from numba import jit
# Versions
# python 3.9.12
# numba 0.55.1
# numpy 1.21.5
# Creating a toy array
input_array = np.array([np.nan,'C','P'], dtype="<U11")
print(input_array) # ['nan' 'C' 'P' 'nan']
# Starting with the python version of the code to show what the aim is:
def foo_python(input_array):
# Creating output array
output_array = np.empty(shape=3, dtype="float32")
# 1st procedure - Replace missings by "Missing"
input_array[input_array == 'nan'] = "Miss"
# 2nd procedure - map strings to numbers
output_array[0] = {"False": -0.01960485, "True": 1.1470174, "Miss": -1.0}.get(
str(input_array[0]), input_array[0]
)
# 3rd procedure - checking if value belongs to a list of strings
input_array[1] = np.where(input_array[1] in ['A','B','C'], input_array[1], 'Other')
# 4th procedure - creating a dummy version of the cell
output_array[2] = np.where(input_array[2] == 'K', 1, 0)
return output_array
foo_python(input_array) # array([-1.0000000e+00, -2.6711958e+07, 0.0000000e+00], dtype=float32)
# Numba version:
@jit(nopython=True)
def foo_numba(input_array):
# Creating output array
output_array = np.empty(shape=3, dtype="float32")
# 1st procedure - Replace missings by "Missing"
input_array[input_array == "nan"] = "Miss"
# 2nd procedure- map strings to numbers
output_array[0] = {"False": -0.01960485, "True": 1.1470174, "Miss": -1.0}.get(
str(input_array[0]), input_array[0]
)
# 3rd procedure - checking if value belongs to a list of strings
input_array[1] = np.where(
input_array[1] in ["A", "B", "C"], input_array[1], "Other"
)
# 4th procedure - creating a dummy version of the cell
output_array[2] = np.where(input_array[2] == "K", 1, 0)
return output_array
foo_numba(input_array)
When I apply @jit(nopython=True)
on top of it, these are the errors I got:
For the 1st procedure:
TypingError: No implementation of function Function() found for signature:
setitem(array([unichr x 11], 1d, C), Literalbool, Literalstr)
For the 2nd procedure:
TypingError: - Resolution failure for literal arguments: No implementation of function Function(<function impl_get at 0x000002926D59C310>) found for signature:
impl_get(DictType[unicode_type,float64]<iv=None>, unicode_type, [unichr x 11])
For the 3rd procedure:
TypingError: No implementation of function Function(<function where at 0x0000029264686C10>) found for signature:
where(bool, [unichr x 11], Literalstr)
For the 4th procedure:
It works! So I believe the error on 3rd procedure may not be np.where, but rather on the type difference between input_array[1] and 'Other'?
I've tried to replace the first procedure by a for loop but is that really the best or only solution?
Solution
The given code can be made to run with numba.
- The 1st procedure has a typing issue in the conditional indexing but it can be worked around by using a loop instead of conditional indexing.
- The 2nd procedure relies on a python dict which is not fully supported by numba due to its lack of strict typing. Numba has its own implementation of a typed Dict that can be used but is more cumbersome.
- The 3rd procedure has a similar typing issue in np.where() but can be worked around by modifying the code a bit.
# Loading packages
import numpy as np
from numba import jit
from numba.core import types
from numba.typed import Dict
# Versions
# python 3.9.12
# numba 0.55.1
# numpy 1.21.5
# Creating a toy array
input_array = np.array([np.nan,'C','P'], dtype='<U11')
print(input_array) # ['nan' 'C' 'P' 'nan']
# Starting with the python version of the code to show what the aim is:
def foo_python(input_array):
# Creating output array
output_array = np.empty(shape=3, dtype="float32")
# 1st procedure - Replace missings by "Missing"
input_array[input_array == 'nan'] = "Miss"
# 2nd procedure - map strings to numbers
output_array[0] = {"False": -0.01960485, "True": 1.1470174, "Miss": -1.0}.get(
str(input_array[0]), input_array[0]
)
# 3rd procedure - checking if value belongs to a list of strings
input_array[1] = np.where(input_array[1] in ['A','B','C'], input_array[1], 'Other')
# 4th procedure - creating a dummy version of the cell
output_array[2] = np.where(input_array[2] == 'K', 1, 0)
return output_array
foo_python(input_array) # array([-1.0000000e+00, -2.6711958e+07, 0.0000000e+00], dtype=float32)
# Numba version:
@jit(nopython=True)
def foo_numba(input_array):
# Creating output array
output_array = np.empty(shape=3, dtype="float32")
# 1st procedure - Replace missings by "Missing"
for i,s in enumerate(input_array):
if s == "nan":
input_array[i] = "Miss"
# 2nd procedure- map strings to numbers
d = Dict.empty(
key_type=types.unicode_type,
value_type=types.float64,
)
d["False"] = -0.01960485
d["True"] = 1.1470174
d["Miss"] = -1.0
output_array[0] = 0 # must have float32 type
for k in d.keys():
if input_array[0] == k:
output_array[0] = d[k]
# 3rd procedure - checking if value belongs to a list of strings
if input_array[1] not in ['A','B','C']:
input_array[1] = "Other"
# 4th procedure - creating a dummy version of the cell
output_array[2] = np.where(input_array[2] == "K", 1, 0)
return output_array
foo_numba(input_array)
Typed Dict reference:
https://numba.pydata.org/numba-doc/dev/reference/pysupported.html#typed-dict
Answered By - mpw2
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.