Why do i need a different mt5 tokenizer?

If you are like me and tried several time to use the mt5 model for the span corruption task, you might have encountered some weird issues when using it for text2text generation. Around 2 years ago i reported an issue in a forum question regarding the fact that the model wasnt properly tokenizing the sentinel tokens the first issue was solved but the sentencepiece model remained not properly working, so in an effort to help everyone else that might come around this issue I'm submiting the patched sentencepiece model in here.

Anyway, the issue lies in the tokenizer not properly tokenizing the sentinel tokens. Resulting in the breaking of the sentinel tokens into smaller tokens when they should have been kept intact. Long story short, the sentinel tokens were saved in the sentencepiece model with an empty space prefix (e.g., " <extra_id_N>" instead of "<extra_id_N>") resulting in all this mess, this meant that I had to manually change the sentencepiece model sentinel tokens into the proper format. I'm not sure if this might break something else but i've done some experiments and everything seems fine with the patched model.

I've tried to contact the MT5 repo maintainers several times to no avail in an effort that they publish the fixed model. Anyway, I put this in here in an effort to help others that may encounter the same problem as myself and so that they may know that it is a misconfiguration issue, not necessarily a model performance issue. If you are a member of the team and see this hey! :wave: please look up this issue an see if this fixed model works for you too!

Best regards!

Dhurmir
/

patched-mt5-tokenizer

Why do i need a different mt5 tokenizer?

license: apache-2.0 base_model: - google/mt5-base

Model tree for Dhurmir/patched-mt5-tokenizer