Clarification on the tokenizer: Concatenated tokenizer or Aggregate tokenizer?

by leestevennz - opened 29 days ago

Discussion

leestevennz

29 days ago

•

edited 29 days ago

Hi there,

Quick question about a little technical detail.

I noticed the previous monolingual english versions, used a concatenated tokenizer of size 1024 per language, where you could add new languages.

However I note the multilingual v2 states: "It uses a unified SentencePiece Tokenizer [5] with a vocabulary of 16,384 tokens, optimized across all 25 supported languages"

That to me sounds like the tokenizer was created via aggregating all 25 supported languages? is that correct or am I way off base with that assumption?

Cheers,

Lee

nithinraok

NVIDIA org 13 days ago

Single unified tokenizer for all languages, not concatenated.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment