Issue
QUESTION: How can I use ONNX operators to do string replacements with regular expressions?
I am trying to export a Scikit-Learn machine learning pipeline to the Open Neural Network Exchange (ONNX) format. The pipeline takes text as input. Many of the steps that are included in the pipeline are nicely included in the standard, like a TfIdfVectorizer and a TruncatedSVD transformer. However, the first pipeline step is a custom transformer which makes a set of changes to the input text through the exploitation of regular expressions.
When adding a custom transformer, the scikitlearn-onnx docs suggest that a custom shape and converter function should be written. The converter function in particular must be written by combining a set of predefined operators that exist within the ONNX standard. However, from what I can tell, it is not possible to do even basic string manipulation with the operators that exist.
One of the regular expression powered replacements that I want to make is a unit conversion, for example:
12m -> 12 meters
With Python's re
package this is trivial:
import re
my_string = "The Empire State Building is 443m tall."
meters_pattern = re.compile("(?<=[0-9])m ")
my_transformed_string = re.sub(meters_pattern, " meters ", my_string)
>>> print(my_transformed_string)
The Empire State Building is 443 meters tall.
However, I cannot conceive of a way to do this with the available ONNX operators. Here's what I've thought to try:
- Use a regular expression opererator in a similar manner to the Python example above.
Problem: ONNX does not have a regex operator.
- Evaluate the input string sequentially, one character at a time. If an "m" follows a digit, change the string as described above.
Problem: This approach requires a comparison of strings: does "this character in the string" equal "m"? However, the existing OnnxEqual
operator does not support string comparison.
- Translate the input string, character by character, to it's ASCII decimal equivalent and then perform step 2.
Problem: ONNX does not have a translate-like operator (like GNU tr
) for strings. ONNX also does not support casting non-strictly numeric strings with OnnxCast
.
- Use the
OnnxUnique
operator and it'sinverse_indicies
property to translate the input string to something approximating each character's ASCII decimal value.
Problem: This requires prepending a key string \t\n\r !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_``abcdefghijklmnopqrstuvwxyz{|}~
to the beginning of the input string (so that the numerical values found by OnnxUnique
's inverse_indicies
property have a consistent definition) and splitting the input string into a sequence of tensors of one character each. Unfortunately, OnnxSplit
errors when trying to split a string tensor (see code example below), and OnnxSequenceInsert
does not append strings into a single element tensor, just a sequence of single element tensors into a single tensor with multiple elements.
import re
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from skl2onnx import to_onnx, update_registered_converter
from skl2onnx.common.data_types import StringTensorType
from skl2onnx.algebra.onnx_ops import OnnxSplit, OnnxConstant
from onnxruntime import InferenceSession
class MyTransformer(BaseEstimator, TransformerMixin):
def fit_transform(self, X, y=None):
return re.sub("(?<=[0-9])m ", " meters ", X)
def shape_function(operator):
input = StringTensorType([1])
output = StringTensorType([None, 1])
operator.inputs[0].type = input
operator.outputs[0].type = output
def converter_function(scope, operator, container):
op = operator.raw_operator
opv = container.target_opset
out = operator.outputs
X = operator.inputs[0]
one_tensor = OnnxConstant(value_int=1, op_version=opv)
string_tensor = OnnxConstant(value_strings=["ab"], op_version=opv)
string_split_tensor = OnnxSplit(string_tensor, one_tensor, op_version=opv, output_names=out[:1])
string_split_tensor.add_to(scope, container)
update_registered_converter(MyTransformer, "MyTransformer", shape_function, converter_function)
my_transformer = MyTransformer()
onnx_model = to_onnx(my_transformer, initial_types=[["X", StringTensorType([None, 1])]])
test_string = "The Empire State Building is 443m tall."
sess = InferenceSession(onnx_model.SerializeToString())
output = sess.run(None, {"X": np.array([test_string])})
Yields:
2022-08-16 12:35:46.235861185 [W:onnxruntime:, graph.cc:106 MergeShapeInfo] Error merging shape info for output. 'variable' source:{1} target:{,1}. Falling back to lenient
merge.
2022-08-16 12:35:46.237767860 [E:onnxruntime:, inference_session.cc:1530 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/optimizer/optimizer_execution_frame.cc:75 onnxruntime::OptimizerExecutionFrame::Info::Info(const std::vector<const onnxruntime::Node*>&, const InitializedTensorSet&, const onnxruntime::Path&,
const onnxruntime::IExecutionProvider&, const std::function<bool(const std::__cxx11::basic_string<char>&)>&) [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : string tensor can not use pre-allocated buffer
How is one to properly manipulate strings with the available ONNX operators?
Solution
I asked the ONNX developers this question, and as of August 2022, it simply is not possible to perform REGEX replacements with ONNX operators. See the full thread here: https://github.com/onnx/onnx/issues/4450
Answered By - NolantheNerd
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.