KL3M Whitespace Tokenizer Experiment - 16K
This is the 16,384 token variant of the KL3M (Kelvin Legal Large Language Model) whitespace tokenizer experiment, trained on legal domain text with separate space tokens for cleaner word embeddings.
Overview
The KL3M whitespace tokenizers v5 are a family of byte-pair encoding (BPE) tokenizers trained on ~44GB of legal domain text from the KL3M dataset (copyright-clean legal corpus from the ALEA Institute). These tokenizers:
- Treat spaces as separate tokens (no
Ġprefix merging) - Provide cleaner embedding→semantic mappings where each word has a consistent representation
- Use hierarchical vocabulary nesting where smaller vocabularies are proper subsets of larger ones
- Enable vocabulary expansion experiments and transfer learning across vocabulary sizes
Whitespace Tokenization Design
Key difference from GPT-2 style tokenizers: These tokenizers do NOT merge spaces with following words. Instead, spaces get their own token (Ġ), and words are tokenized independently without the space prefix.
Why This Design?
Standard BPE tokenizers (GPT-2, LLaMA, etc.) prepend spaces to tokens, so "the" and " the" become different tokens with different embeddings. This is efficient for compression but creates semantic inconsistency:
| Standard BPE | KL3M Whitespace |
|---|---|
[The][ the][ THE] = 3 different embeddings |
[The][Ġ][the][Ġ][THE] = same word, same embedding |
| "contract" and " contract" are different | "contract" always has ONE embedding |
Trade-off
- Lower compression ratio (~3.1 chars/token vs ~4.0 for GPT-2 style)
- Cleaner semantics - each word maps to a consistent embedding regardless of position
Example: "The United States"
GPT-2 style: [The][ United][ States] (3 tokens, each word has space baked in)
KL3M Whitespace: [The][Ġ][United][Ġ][States] (5 tokens, spaces explicit)
Performance
On legal documents, these tokenizers trade compression for semantic clarity:
| Tokenizer | Vocab Size | Compression |
|---|---|---|
| KL3M v5-64K | 65,536 | ~3.1 chars/token |
| KL3M v5-32K | 32,768 | ~3.0 chars/token |
| KL3M v5-16K | 16,384 | ~2.9 chars/token |
| KL3M v5-8K | 8,192 | ~2.7 chars/token |
| KL3M v5-4K | 4,096 | ~2.5 chars/token |
Tokenizer Family
This tokenizer is part of a hierarchically nested family. Token IDs in smaller vocabularies are identical across all larger vocabularies, enabling seamless vocabulary expansion:
| Vocabulary Size | HuggingFace Repository |
|---|---|
| 4,096 (4K) | alea-institute/kl3m-tokenizer-003-4k |
| 8,192 (8K) | alea-institute/kl3m-tokenizer-003-8k |
| 16,384 (16K) | alea-institute/kl3m-tokenizer-003-16k |
| 32,768 (32K) | alea-institute/kl3m-tokenizer-003-32k |
| 65,536 (64K) | alea-institute/kl3m-tokenizer-003-64k |
→ You are viewing: 16,384 (16K)
Key Features
1. Clean Embedding Semantics
By keeping spaces as separate tokens, words have consistent embeddings:
- "contract" is always the same token, whether at start of sentence or mid-sentence
- No duplicate vocabulary entries for " word" vs "word"
- Embedding space maps more directly to semantic meaning
2. Hierarchical Vocabulary Nesting
Token IDs 0-4,095 are identical across all tokenizer sizes. This enables:
- Vocabulary expansion during training: Start with 4K vocab, expand to 32K mid-training
- Transfer learning: Initialize larger vocab models from smaller vocab checkpoints
- Controlled ablations: Compare vocab sizes while maintaining token alignment
3. Legal Domain Optimization
Trained on the KL3M corpus (44GB of legal text):
- Court opinions and case law
- Contracts and agreements
- Patents and IP documents
- Legal briefs and filings
- Statutory and regulatory text
4. Special Tokens
Seven essential special tokens for language model training:
| Token | ID | Purpose |
|---|---|---|
<|start|> |
0 | Start of sequence |
<|end|> |
1 | End of sequence |
<|pad|> |
2 | Padding token |
<|unk|> |
3 | Unknown token |
<|cls|> |
4 | Classification (BERT) |
<|sep|> |
5 | Separator (BERT) |
<|mask|> |
6 | Mask token (MLM) |
Usage
With Transformers
from transformers import PreTrainedTokenizerFast
# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(
"alea-institute/kl3m-tokenizer-003-16k"
)
# Tokenize text
text = "The Licensor hereby grants to Licensee a non-exclusive license."
tokens = tokenizer.encode(text)
print(f"Tokens: {len(tokens)}")
# Decode
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")
With tokenizers Library
from tokenizers import Tokenizer
# Load tokenizer
tokenizer = Tokenizer.from_pretrained(
"alea-institute/kl3m-tokenizer-003-16k"
)
# Encode
encoding = tokenizer.encode(text)
print(f"Tokens: {encoding.tokens}")
print(f"IDs: {encoding.ids}")
Training a Model
from transformers import AutoConfig, AutoModelForMaskedLM, PreTrainedTokenizerFast
# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(
"alea-institute/kl3m-tokenizer-003-16k"
)
# Create model config
config = AutoConfig.from_pretrained(
"bert-base-uncased",
vocab_size=tokenizer.vocab_size,
)
# Initialize model
model = AutoModelForMaskedLM.from_config(config)
# Train with HuggingFace Trainer...
Technical Details
Training Corpus
- Source: KL3M (Kelvin Legal Large Language Model) dataset
- Size: ~44.2 GB
- Domain: Legal documents (copyright-clean)
Vocabulary Structure
- Base vocabulary: 256 bytes + special tokens
- Learned merges: BPE merges up to vocab_size
- Nesting property: All tokens in size N exist in size 2N
- Space handling: Space (
Ġ) is a separate token, never merged with words
When to Use These Tokenizers
Use KL3M Whitespace tokenizers when:
- You want cleaner word→embedding mappings
- Semantic consistency matters more than compression
- You're doing embedding analysis or interpretability research
- You want words to have position-independent representations
Use standard BPE tokenizers (GPT-2 style) when:
- Maximum compression is the priority
- You need compatibility with pretrained models
- Sequence length is a critical constraint
Limitations
- Lower compression: ~25% more tokens than GPT-2 style for same text
- Training domain: Optimized for legal English text; may underperform on other domains
- Multilingual: Trained primarily on English; limited non-English support
License
MIT License
About ALEA Institute
The ALEA Institute develops open-source tools and datasets for legal AI, including the KL3M corpus and tokenizers.
Related Resources
- KL3M Dataset: aleainstitute.ai/work/kl3m
- Multi-word Tokenizers: alea-institute/kl3m-multi-word-002-*
Contact
For questions or issues, please visit: github.com/alea-institute