Clarification on the tokenizer: Concatenated tokenizer or Aggregate tokenizer?

#9
by leestevennz - opened

Hi there,

Quick question about a little technical detail.

I noticed the previous monolingual english versions, used a concatenated tokenizer of size 1024 per language, where you could add new languages.

However I note the multilingual v2 states: "It uses a unified SentencePiece Tokenizer [5] with a vocabulary of 16,384 tokens, optimized across all 25 supported languages"

That to me sounds like the tokenizer was created via aggregating all 25 supported languages? is that correct or am I way off base with that assumption?

Cheers,

Lee

NVIDIA org

Single unified tokenizer for all languages, not concatenated.

Sign up or log in to comment