Clarification on the tokenizer: Concatenated tokenizer or Aggregate tokenizer?

by leestevennz - opened Aug 25

Discussion

leestevennz

Aug 25

•

edited Aug 25

Hi there,

Quick question about a little technical detail.

I noticed the previous monolingual english versions, used a concatenated tokenizer of size 1024 per language, where you could add new languages.

However I note the multilingual v2 states: "It uses a unified SentencePiece Tokenizer [5] with a vocabulary of 16,384 tokens, optimized across all 25 supported languages"

That to me sounds like the tokenizer was created via aggregating all 25 supported languages? is that correct or am I way off base with that assumption?

Cheers,

Lee

nithinraok

NVIDIA org Sep 10

Single unified tokenizer for all languages, not concatenated.

mrunique007

Sep 25

Hi @nithinraok ,

I wonder where can I find unified tokenizer model which used in canary-1b-v2 to continue finetune ?

nithinraok

NVIDIA org Sep 26

inside nemo file.
tar -xvf canary-1b-v2.nemo
"cc5d48e83aad4be48aa9fa264b727c4b_tokenizer.model"

leestevennz

Sep 26

•

edited Sep 26

@nithinraok

Thanks so much for confirming!

I have to say, I found this a bit confusing at first. In the original concatenated tokenizer paper (https://aclanthology.org/2023.calcs-1.7/), the authors compare a "concatenated tokenizer" with an "agg (aggregate) tokenizer", and they define that agg tokenizer the same way as what's called a "unified tokenizer."

But then in the NeMo codebase, the concatenated tokenizer is actually referred to as the "aggregated tokenizer" with type 'agg'!

It's a bit of a naming puzzle, but I appreciate you helping clear it up! 😊

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment