Post
322
Tokenization is one of the most important processes in AI - yet many would like to kill it 💀
What's tokenization? The neural networks inside LLMs actually only process numbers, not text: tokenization is the process that makes text readable for them, by converting sentences into lists of numbers.
➡️ For instance, "This is tokenization" would be split into "This | is | token | ization", then each of the parts (tokens) are converted to IDs according to a predefined mapping: for instance "ization" could map to id 2438.
Thus "This is tokenization" can become 1335 | 135 | 2980 | 2438 => now the model can process the sentence!
Most tokenizers today use pre-specified mappings called "vocabularies", generally built about the compression algorithme Byte-Pair Encoding (BPE) that learns from a big corpuses of texts an optimized split to efficiently encode any text from the same distribution into a list token IDs.
🤨 Now, these current tokenizers have flaws.
For instance, the rigidity of their mapping creates losses ; the prime example being that a tokenizer designed for English (thus optimized for tokens like "has", "been", "clock", etc) will not have the right tokens to approach Burmese, thus being terribly inefficient at it.
Many alternative approaches have emerged as a result: for instance "tokenizer-free tokenizers". One that I really liked was "entropy-based": it monitors the stream of text, and trigger a split whenever the entropy increases too much, i.e. when something "surprising" happens.
But this great article argues that tokenizers are a lesser evil. Read and decide for yourself!
https://huggingface.co/blog/catherinearnett/in-defense-of-tokenizers
What's tokenization? The neural networks inside LLMs actually only process numbers, not text: tokenization is the process that makes text readable for them, by converting sentences into lists of numbers.
➡️ For instance, "This is tokenization" would be split into "This | is | token | ization", then each of the parts (tokens) are converted to IDs according to a predefined mapping: for instance "ization" could map to id 2438.
Thus "This is tokenization" can become 1335 | 135 | 2980 | 2438 => now the model can process the sentence!
Most tokenizers today use pre-specified mappings called "vocabularies", generally built about the compression algorithme Byte-Pair Encoding (BPE) that learns from a big corpuses of texts an optimized split to efficiently encode any text from the same distribution into a list token IDs.
🤨 Now, these current tokenizers have flaws.
For instance, the rigidity of their mapping creates losses ; the prime example being that a tokenizer designed for English (thus optimized for tokens like "has", "been", "clock", etc) will not have the right tokens to approach Burmese, thus being terribly inefficient at it.
Many alternative approaches have emerged as a result: for instance "tokenizer-free tokenizers". One that I really liked was "entropy-based": it monitors the stream of text, and trigger a split whenever the entropy increases too much, i.e. when something "surprising" happens.
But this great article argues that tokenizers are a lesser evil. Read and decide for yourself!
https://huggingface.co/blog/catherinearnett/in-defense-of-tokenizers