Why do i need a different mt5 tokenizer?

If you are like me and tried several time to use the mt5 model for the span corruption task, you might have encountered some weird issues when using it for text2text generation. Around 2 years ago i reported an issue in a forum question regarding the fact that the model wasnt properly tokenizing the sentinel tokens the first issue was solved but the sentencepiece model remained not properly working, so in an effort to help everyone else that might come around this issue I'm submiting the patched sentencepiece model in here.

Anyway, the issue lies in the tokenizer not properly tokenizing the sentinel tokens. Resulting in the breaking of the sentinel tokens into smaller tokens when they should have been kept intact. Long story short, the sentinel tokens were saved in the sentencepiece model with an empty space prefix (e.g., " <extra_id_N>" instead of "<extra_id_N>") resulting in all this mess, this meant that I had to manually change the sentencepiece model sentinel tokens into the proper format. I'm not sure if this might break something else but i've done some experiments and everything seems fine with the patched model.

I've tried to contact the MT5 repo maintainers several times to no avail in an effort that they publish the fixed model. Anyway, I put this in here in an effort to help others that may encounter the same problem as myself and so that they may know that it is a misconfiguration issue, not necessarily a model performance issue. If you are a member of the team and see this hey! :wave: please look up this issue an see if this fixed model works for you too!

Best regards!


license: apache-2.0 base_model: - google/mt5-base

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Dhurmir/patched-mt5-tokenizer

Base model

google/mt5-base
Finetuned
(210)
this model