Clarification on the tokenizer: Concatenated tokenizer or Aggregate tokenizer?
Hi there,
Quick question about a little technical detail.
I noticed the previous monolingual english versions, used a concatenated tokenizer of size 1024 per language, where you could add new languages.
However I note the multilingual v2 states: "It uses a unified SentencePiece Tokenizer [5] with a vocabulary of 16,384 tokens, optimized across all 25 supported languages"
That to me sounds like the tokenizer was created via aggregating all 25 supported languages? is that correct or am I way off base with that assumption?
Cheers,
Lee
Single unified tokenizer for all languages, not concatenated.
Hi @nithinraok ,
I wonder where can I find unified tokenizer model which used in canary-1b-v2 to continue finetune ?
inside nemo file.tar -xvf canary-1b-v2.nemo
"cc5d48e83aad4be48aa9fa264b727c4b_tokenizer.model"
Thanks so much for confirming!
I have to say, I found this a bit confusing at first. In the original concatenated tokenizer paper (https://aclanthology.org/2023.calcs-1.7/), the authors compare a "concatenated tokenizer" with an "agg (aggregate) tokenizer", and they define that agg tokenizer the same way as what's called a "unified tokenizer."
But then in the NeMo codebase, the concatenated tokenizer is actually referred to as the "aggregated tokenizer" with type 'agg'!
It's a bit of a naming puzzle, but I appreciate you helping clear it up! ๐