Thursday, December 2, 2021

[FIXED] transformers BartTokenizer::add_tokens() Doesn't Work as I'd Expect for Suffixes

December 02, 2021 huggingface-transformers, python No comments

Issue

I seem to be able to add tokens without issue but if I try to add a suffix (ie.. one that doesn't have the init character 'Ġ' at the front), the tokenizer doesn't put spaces in the right spots. Here's some very simplified test code.

from   copy import deepcopy
from   transformers import BartTokenizer
# Get the different tokenizers
tokenizer     = BartTokenizer.from_pretrained('facebook/bart-base')
tokenizer_ext = deepcopy(tokenizer)
# Note that putting Ġ after the token causes the token not to be used
num_added     = tokenizer_ext.add_tokens(['-of', '_01', 'WXYZ'])
# Original sentence
print('Original')
serial  = ':ARG0-of ( sense_01 :ARG1 ( urgencyWXYZ )'
print(serial)
print()
# Baseline tokenizer
print('Bart default tokenizer')
tokens  = tokenizer.tokenize(serial)
out_str = tokenizer.convert_tokens_to_string(tokens)
print(tokens)
print(out_str)
print()
# extended tokenizer
print('Extended tokenizer')
tokens  = tokenizer_ext.tokenize(serial)
out_str = tokenizer_ext.convert_tokens_to_string(tokens)
print(tokens)
print(out_str)

This gives...

Original
:ARG0-of ( sense_01 :ARG1 ( urgencyWXYZ )

Bart default tokenizer
[':', 'AR', 'G', '0', '-', 'of', 'Ġ(', 'Ġsense', '_', '01', 'Ġ:', 'AR', 'G', '1', 'Ġ(', 'Ġurgency', 'W', 'XY', 'Z', 'Ġ)']
:ARG0-of ( sense_01 :ARG1 ( urgencyWXYZ )

Extended tokenizer
[':', 'AR', 'G', '0', '-of', '(', 'Ġsense', '_01', ':', 'AR', 'G', '1', 'Ġ(', 'Ġurgency', 'WXYZ', ')']
:ARG0-of( sense_01:ARG1 ( urgencyWXYZ)

Notice that the default bart tokenizer produces the same output as the original sentence but the extended tokenizer doesn't put in spaces after the new suffix tokens. ie.. it selects '(' instead of 'Ġ('. Any idea why this is and what's the right way to add suffix tokens?

Solution

The short answer is that there's "behavior" (bug?) in the handling of added tokens for Bart (and RoBerta, GPT2, etc..) that explicitly strips spaces from the tokens adjacent (both left and right) to the added token's location. I don't see a simple work-around to this.

Added tokens are handled differently in the transformer's tokenizer code. The text is first split, using a Trie to identify any tokens in the added tokens list (see tokenization_utils.py::tokenize()). After finding any added tokens in the text, the remainder is then tokenized using the existing vocab/bpe encoding scheme (see tokenization_gpt2.py::_tokenize())

The added tokens are added to the self.unique_no_split_tokens list which prevents them from being broken down further, into smaller chunks. The code that handles this (see tokenization_utils.py::tokenize() explicitly strips the spaces from the tokens to the left and right.

You could manually remove them from the "no split" list but then they may be broken down into smaller sub-components.

Note that for "special tokens", if you add the token inside of the AddedToken class you can set the lstrip and rstrip behaviors but this isn't available for non-special tokens.

See https://github.com/huggingface/transformers/blob/v4.12.5-release/src/transformers/tokenization_utils.py#L517 for the else statement where the spaces are stripped.

Answered By - bivouac0

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, December 2, 2021

[FIXED] transformers BartTokenizer::add_tokens() Doesn't Work as I'd Expect for Suffixes

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels