Clarification on the tokenizer: Concatenated tokenizer or Aggregate tokenizer?

#9
by leestevennz - opened

Hi there,

Quick question about a little technical detail.

I noticed the previous monolingual english versions, used a concatenated tokenizer of size 1024 per language, where you could add new languages.

However I note the multilingual v2 states: "It uses a unified SentencePiece Tokenizer [5] with a vocabulary of 16,384 tokens, optimized across all 25 supported languages"

That to me sounds like the tokenizer was created via aggregating all 25 supported languages? is that correct or am I way off base with that assumption?

Cheers,

Lee

NVIDIA org

Single unified tokenizer for all languages, not concatenated.

Hi @nithinraok ,

I wonder where can I find unified tokenizer model which used in canary-1b-v2 to continue finetune ?

NVIDIA org

inside nemo file.
tar -xvf canary-1b-v2.nemo
"cc5d48e83aad4be48aa9fa264b727c4b_tokenizer.model"

@nithinraok

Thanks so much for confirming!

I have to say, I found this a bit confusing at first. In the original concatenated tokenizer paper (https://aclanthology.org/2023.calcs-1.7/), the authors compare a "concatenated tokenizer" with an "agg (aggregate) tokenizer", and they define that agg tokenizer the same way as what's called a "unified tokenizer."

But then in the NeMo codebase, the concatenated tokenizer is actually referred to as the "aggregated tokenizer" with type 'agg'!

It's a bit of a naming puzzle, but I appreciate you helping clear it up! ๐Ÿ˜Š

Sign up or log in to comment