Clarification on the tokenizer: Concatenated tokenizer or Aggregate tokenizer?
#9
by
leestevennz
- opened
Hi there,
Quick question about a little technical detail.
I noticed the previous monolingual english versions, used a concatenated tokenizer of size 1024 per language, where you could add new languages.
However I note the multilingual v2 states: "It uses a unified SentencePiece Tokenizer [5] with a vocabulary of 16,384 tokens, optimized across all 25 supported languages"
That to me sounds like the tokenizer was created via aggregating all 25 supported languages? is that correct or am I way off base with that assumption?
Cheers,
Lee
Single unified tokenizer for all languages, not concatenated.