Issue
i'm trying to build a tokenizer following the tf's tutorial https://www.tensorflow.org/text/guide/subwords_tokenizer. I'm basically doing the same thing only with a different dataset. The dataset in question is a txt file in which the first two columns are an english sentence or word and the translation in italian, here a snippet:
Hi. Ciao! CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #607364 (Cero)
Hi. Ciao. CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #4522287 (Guybrush88)
Run! Corri! CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906347 (Guybrush88)
Run! Corra! CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906348 (Guybrush88)
Run! Correte! CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906350 (Guybrush88)
Who? Chi? CC-BY 2.0 (France) Attribution: tatoeba.org #2083030 (CK) & #2126402 (Guybrush88)
it can be downloaded at http://www.manythings.org/anki/
i've preprocessed it and turned the english and italian sentences to tensorflow datasets to be fed to the tokenizer as illustrated in this code:
import tensorflow as tf
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab
import tensorflow_text as tf_text
import os
import numpy as np
eng_dataset, ita_dataset = np.genfromtxt('ita_eng_dataset.txt',
usecols=(0, 1),
encoding='utf-8',
unpack=True,
dtype='str')
eng_dataset_tensor = tf.convert_to_tensor(eng_dataset)
ita_dataset_tensor = tf.convert_to_tensor(ita_dataset)
eng_tf_dataset = tf.data.Dataset.from_tensor_slices(eng_dataset_tensor)
ita_tf_dataset = tf.data.Dataset.from_tensor_slices(ita_dataset_tensor)
The problems arise when i try to fed it to bert_vocab_from_dataset
:
bert_tokenizer_params = dict(lower_case=True)
reserved_tokens = ["[PAD]", "[UNK]", "[START]", "[END]"]
bert_vocab_args = dict(
# The target vocabulary size
vocab_size=8000,
# Reserved tokens that must be included in the vocabulary
reserved_tokens=reserved_tokens,
# Arguments for `text.BertTokenizer`
bert_tokenizer_params=bert_tokenizer_params,
# Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
learn_params={},
)
eng_vocab = bert_vocab.bert_vocab_from_dataset(eng_tf_dataset, **bert_vocab_args)
ita_vocab = bert_vocab.bert_vocab_from_dataset(ita_tf_dataset, **bert_vocab_args)
but the results are wrong:
print(eng_vocab[:20])
print(ita_vocab[1980:2000])
print(len(eng_vocab), len(ita_vocab))
which outputs
['about', 'breakfast', 'coffee', 'correct', 'finally', 'heat', 'japanese', 'large', 'lie', 'old', 'peel', 'science', 'step', 'swimming', 'work', '##ans', '##b', '##der', '##ins', '##ish']
['##omfortable', '##ong', '##ony', '##op', '##ouse', '##ply', '##rch', '##rous', '##rove', '##roved', '##sists', '##tained', '##ten', '##unted', '##val', '##ze', 'advice', 'agitated', 'amazed', 'argued']
665 2413
as you can see the italian vocabulary contains english text and are both very little (it maybe due to the dataset but seems odd to be so small, less than 1000 vocabs even).
I also tried batching the input dataset as in the tensorflow tutorial but it gave the same results.
I'm using python 3.8 on pycharm with windows 11 and tensorflow 2.10
Solution
Solved, it was just the np.genfromtxt non using '\t' as delimiter by default.
Answered By - Niccolò Tiezzi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.