Issue
How to use Keras Tokenizer method fit_on_texts
?
How does it differ from fit_on_sequences
?
Solution
fit_on_texts
used in conjunction with texts_to_matrix
produces the one-hot encoding for a text, see https://www.tensorflow.org/text/guide/word_embeddings
fit_on_texts
An example for using fit_on_texts
from keras.preprocessing.text import Tokenizer
text='check check fail'
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
tokenizer.word_index
will produce {'check': 1, 'fail': 2}
Note that we use [text]
as an argument since input must be a list, where each element of the list is considered a token. Input can also be a text generator or a list of list of strings.
Passing a text generator as an input is memory efficient, here an example: (1) defining a text generator returning an iterable collection of texts
def text_generator(texts_generator):
for texts in texts_generator:
for text in texts:
yield text
(2) passing it as an input to fit_on_texts
tokenizer.fit_on_text(text_generator)
fit_on_texts
is used before calling texts_to_matrix
which produces the one-hot encoding for the original set of texts.
num_words argument
Passing the num_words
argument to the tokenizer will specify the number of (most frequent) words we consider in the representation. An example, first num_words = 1
and we just encode on the most frequent word, love
sentences = [
'i love my dog',
'I, love my cat',
'You love my dog!'
]
tokenizer = Tokenizer(num_words = 1+1)
tokenizer.fit_on_texts(sentences)
tokenizer.texts_to_sequences(sentences) # [[1], [1], [1]]
Second, num_words = 100
, we encode on the 100 most frequent words
tokenizer = Tokenizer(num_words = 100+1)
tokenizer.fit_on_texts(sentences)
tokenizer.texts_to_sequences(sentences) # [[3, 1, 2, 4], [3, 1, 2, 5], [6, 1, 2, 4]]
fit_on_sequences
Fit_on_sequences
works on "sequences" i.e., lists of integer word indices. It is used before calling sequence_to_matrix
from tensorflow.keras.preprocessing.text import Tokenizer
test_seq = [[1,2,3,4,5,6]]
tok = Tokenizer(num_words=10)
tok.fit_on_sequences(test_seq)
tok.sequences_to_matrix(test_seq)
Producing
array([[0., 1., 1., 1., 1., 1., 1., 0., 0., 0.]])
Answered By - kiriloff
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.