Tokenizer doesn't distinguish dash and hyphen

#10
by nshmyrevgmail - opened

он шутит - сказал человек - амфибия.

['[CLS]', '-', 'он', 'шутит', '-', 'сказал', 'человек', '-', 'амфи', '##бия', '.', '[SEP]']

он шутит - сказал человек-амфибия.

['[CLS]', '-', 'он', 'шутит', '-', 'сказал', 'человек', '-', 'амфи', '##бия', '.', '[SEP]']

While it is a common issue, it is a bigger problem for Russian where hyphen is much more actively used than in English

https://github.com/huggingface/transformers/issues/21439

Sign up or log in to comment