Fill-Mask
Transformers
PyTorch
eurobert
code
custom_code

EOS token is also padding token

#3
by stephantulkens - opened

Hello!

There's some weird behavior with the tokenizer. When encoding text using the tokenizer from HF, it does not include an eos token, e.g.:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m")
tok.encode("dogs")
# output: [128000, 18964]

The tokenizer seems to use <end_of_text> as the padding token, however:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m")
tok.batch_encode_plus(["dogs", "many cats"], padding=True)
# output: [[128000, 81134, 128001], [128000, 35676, 19987]]

When we look at the special tokens, the <end_of_text> token is indeed stored as both the padding token and eos_token. Because it is also stored as the pad token, any eos tokens are truncated after encoding.

{'bos_token': '<|begin_of_text|>',
 'eos_token': '<|end_of_text|>',
 'pad_token': '<|end_of_text|>',
 'mask_token': '<|mask|>'}

A cursory look at the other tokens showed that there doesn't seem to be a dedicated padding token in the vocabulary.

When using the bare tokenizer (the backend model), every instance is padded until length 512 with <end_of_text> tokens. This is in the config (it has a fixed padding strategy with 512 tokens, but it looked a little bit weird to me.

So I'm just here to confirm whether this is intended, or whether the tokenizer should have a dedicated padding token which went missing. Thanks!

EuroBERT org

Hello!

Thank you for your interest!

Indeed, you can use the eos token as a padding, this is what we have been doing during fine-tuning.
Regarding the default padding value being 512, that was a misconfiguration in the config file. I have updated it to remove that information.

It should be fixed if you redownload the tokenizer:

tokenizer = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m", force_download=True)

Let me know if this fixes the issue for you!

Sign up or log in to comment