Issue
I have tried writing a python program to save tf.keras.layers.TextVectorization to disk and load it with the answer of How to save TextVectorization to disk in tensorflow?.
The TextVectorization layer built from saved configs outputs a vector with wrong length when the arg output_sequence_length
is not None
and output_mode='int'
.
For example, if I set output_sequence_length= 10
, and output_mode='int'
, it is expected that given a text, TextVectorization should output a vector with length of 10, see vectorizer
and new_v2
in the code below.
However, if TextVectorization's arg output_mode='int'
is set from saved configs, it doesn't output a vector with length of 10(actually it is 9, the real length of the sentence. It seems like output_sequence_length
is not set successfully). See the object new_v1
in the code below.
The interesting thing is, I have compared from_disk['config']['output_mode']
and 'int'
, they equal to each other.
import tensorflow as tf
from tensorflow.keras.models import load_model
import pickle
# In[]
max_len = 10 # Sequence length to pad the outputs to.
text_dataset = tf.data.Dataset.from_tensor_slices([
"I like natural language processing",
"You like computer vision",
"I like computer games and computer science"])
# Fit a TextVectorization layer
VOCAB_SIZE = 10 # Maximum vocab size.
vectorizer = tf.keras.layers.TextVectorization(
max_tokens=None,
standardize="lower_and_strip_punctuation",
split="whitespace",
output_mode='int',
output_sequence_length=max_len
)
vectorizer.adapt(text_dataset.batch(64))
# In[]
#print(vectorizer.get_vocabulary())
#print(vectorizer.get_config())
#print(vectorizer.get_weights())
# In[]
# Pickle the config and weights
pickle.dump({'config': vectorizer.get_config(),
'weights': vectorizer.get_weights()}
, open("./models/tv_layer.pkl", "wb"))
# Later you can unpickle and use
# `config` to create object and
# `weights` to load the trained weights.
from_disk = pickle.load(open("./models/tv_layer.pkl", "rb"))
new_v1 = tf.keras.layers.TextVectorization(
max_tokens=None,
standardize="lower_and_strip_punctuation",
split="whitespace",
output_mode=from_disk['config']['output_mode'],
output_sequence_length=from_disk['config']['output_sequence_length'],
)
# You have to call `adapt` with some dummy data (BUG in Keras)
new_v1.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_v1.set_weights(from_disk['weights'])
new_v2 = tf.keras.layers.TextVectorization(
max_tokens=None,
standardize="lower_and_strip_punctuation",
split="whitespace",
output_mode='int',
output_sequence_length=from_disk['config']['output_sequence_length'],
)
# You have to call `adapt` with some dummy data (BUG in Keras)
new_v2.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_v2.set_weights(from_disk['weights'])
print ("*"*10)
# In[]
test_sentence="Jack likes computer scinece, computer games, and foreign language"
print(vectorizer(test_sentence))
print (new_v1(test_sentence))
print (new_v2(test_sentence))
print(from_disk['config']['output_mode']=='int')
Here are the print() outputs:
**********
tf.Tensor([ 1 1 3 1 3 11 12 1 10 0], shape=(10,), dtype=int64)
tf.Tensor([ 1 1 3 1 3 11 12 1 10], shape=(9,), dtype=int64)
tf.Tensor([ 1 1 3 1 3 11 12 1 10 0], shape=(10,), dtype=int64)
True
Does anyone know why?
Solution
the bug is fixed by the PR in https://github.com/keras-team/keras/pull/15422
Answered By - lankuohsing
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.