Issue
for example, a tensor type data below is tokenized by a kind of English tokenizer.
tensor([[ 2992, 1852, 9439, ..., 2610, 1704, 29189],
[ 1852, 9439, 7, ..., 1704, 29189, 23223],
[ 9439, 7, 2367, ..., 29189, 23223, 838],
...,
[ 12, 7469, 28844, ..., 2973, 16, 73],
[ 7469, 28844, 28469, ..., 16, 73, 735],
[28844, 28469, 191, ..., 73, 735, 4482]])
how to transform it to original English text? (using Pytorch)
Solution
The method you're looking for is tokenizer.decode
, which is applied to sequences of numbers to yield the original source text. In your case, you have a batch of sentences (i.e. sequence of sequences) so you'll need to iterate the function over your tensor, i.e.
decoded = [tokenizer.decode(x) for x in xs]
where tokenizer
your tokenization model and xs
the tensor you want to decode.
maybe also useful:
tokenizer
also provides methods convert_ids_to_tokens
which does what the name suggests, and convert_tokens_to_string
which merges subword tokens into words to recover the original input.
Answered By - KonstantinosKokos
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.