Issue
I seem to be able to add tokens without issue but if I try to add a suffix (ie.. one that doesn't have the init character 'Ġ'
at the front), the tokenizer doesn't put spaces in the right spots. Here's some very simplified test code.
from copy import deepcopy
from transformers import BartTokenizer
# Get the different tokenizers
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
tokenizer_ext = deepcopy(tokenizer)
# Note that putting Ġ after the token causes the token not to be used
num_added = tokenizer_ext.add_tokens(['-of', '_01', 'WXYZ'])
# Original sentence
print('Original')
serial = ':ARG0-of ( sense_01 :ARG1 ( urgencyWXYZ )'
print(serial)
print()
# Baseline tokenizer
print('Bart default tokenizer')
tokens = tokenizer.tokenize(serial)
out_str = tokenizer.convert_tokens_to_string(tokens)
print(tokens)
print(out_str)
print()
# extended tokenizer
print('Extended tokenizer')
tokens = tokenizer_ext.tokenize(serial)
out_str = tokenizer_ext.convert_tokens_to_string(tokens)
print(tokens)
print(out_str)
This gives...
Original
:ARG0-of ( sense_01 :ARG1 ( urgencyWXYZ )
Bart default tokenizer
[':', 'AR', 'G', '0', '-', 'of', 'Ġ(', 'Ġsense', '_', '01', 'Ġ:', 'AR', 'G', '1', 'Ġ(', 'Ġurgency', 'W', 'XY', 'Z', 'Ġ)']
:ARG0-of ( sense_01 :ARG1 ( urgencyWXYZ )
Extended tokenizer
[':', 'AR', 'G', '0', '-of', '(', 'Ġsense', '_01', ':', 'AR', 'G', '1', 'Ġ(', 'Ġurgency', 'WXYZ', ')']
:ARG0-of( sense_01:ARG1 ( urgencyWXYZ)
Notice that the default bart tokenizer produces the same output as the original sentence but the extended tokenizer doesn't put in spaces after the new suffix tokens. ie.. it selects '('
instead of 'Ġ('
. Any idea why this is and what's the right way to add suffix tokens?
Solution
The short answer is that there's "behavior" (bug?) in the handling of added tokens for Bart (and RoBerta, GPT2, etc..) that explicitly strips spaces from the tokens adjacent (both left and right) to the added token's location. I don't see a simple work-around to this.
Added tokens are handled differently in the transformer's tokenizer code. The text is first split, using a Trie
to identify any tokens in the added tokens list (see tokenization_utils.py::tokenize()
). After finding any added tokens in the text, the remainder is then tokenized using the existing vocab/bpe encoding scheme (see tokenization_gpt2.py::_tokenize()
)
The added tokens are added to the self.unique_no_split_tokens
list which prevents them from being broken down further, into smaller chunks. The code that handles this (see tokenization_utils.py::tokenize()
explicitly strips the spaces from the tokens to the left and right.
You could manually remove them from the "no split" list but then they may be broken down into smaller sub-components.
Note that for "special tokens", if you add the token inside of the AddedToken
class you can set the lstrip
and rstrip
behaviors but this isn't available for non-special tokens.
See https://github.com/huggingface/transformers/blob/v4.12.5-release/src/transformers/tokenization_utils.py#L517 for the else statement where the spaces are stripped.
Answered By - bivouac0
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.