Issue
Is there a way to know the mapping from the tokens back to the original words in the tokenizer.decode()
function?
For example:
from transformers.tokenization_roberta import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)
str = "This is a tokenization example"
tokenized = tokenizer.tokenize(str)
## ['this', 'Ġis', 'Ġa', 'Ġtoken', 'ization', 'Ġexample']
encoded = tokenizer.encode_plus(str)
## encoded['input_ids']=[0, 42, 16, 10, 19233, 1938, 1246, 2]
decoded = tokenizer.decode(encoded['input_ids'])
## '<s> this is a tokenization example</s>'
And the objective is to have a function that maps each token in the decode
process to the correct input word, for here it will be:
desired_output = [[1],[2],[3],[4,5],[6]]
As this
corresponds to id 42
, while token
and ization
corresponds to ids [19244,1938]
which are at indexes 4,5
of the input_ids
array.
Solution
If you use the fast tokenizers, i.e. the rust backed versions from the tokenizers
library the encoding contains a word_ids
method that can be used to map sub-words back to their original word. What constitutes a word
vs a subword
depends on the tokenizer, a word is something generated by the pre-tokenization stage, i.e. split by whitespace, a subword is generated by the actual model (BPE
or Unigram
for example).
The code below should work in general, even if the pre-tokenization performs additional splitting. For example I created my own custom step that splits based on PascalCase - the words
here are Pascal
and Case
, the accepted answer wont work in this case since it assumes words are whitespace delimited.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-large', do_lower_case=True)
example = "This is a tokenization example"
encoded = tokenizer(example)
desired_output = []
for word_id in encoded.word_ids():
if word_id is not None:
start, end = encoded.word_to_tokens(word_id)
if start == end - 1:
tokens = [start]
else:
tokens = [start, end-1]
if len(desired_output) == 0 or desired_output[-1] != tokens:
desired_output.append(tokens)
desired_output
Answered By - David Waterworth
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.