KL3M Whitespace Tokenizer Experiment - 16K

This is the 16,384 token variant of the KL3M (Kelvin Legal Large Language Model) whitespace tokenizer experiment, trained on legal domain text with separate space tokens for cleaner word embeddings.

Overview

The KL3M whitespace tokenizers v5 are a family of byte-pair encoding (BPE) tokenizers trained on ~44GB of legal domain text from the KL3M dataset (copyright-clean legal corpus from the ALEA Institute). These tokenizers:

Treat spaces as separate tokens (no Ġ prefix merging)
Provide cleaner embedding→semantic mappings where each word has a consistent representation
Use hierarchical vocabulary nesting where smaller vocabularies are proper subsets of larger ones
Enable vocabulary expansion experiments and transfer learning across vocabulary sizes

Whitespace Tokenization Design

Key difference from GPT-2 style tokenizers: These tokenizers do NOT merge spaces with following words. Instead, spaces get their own token (Ġ), and words are tokenized independently without the space prefix.

Why This Design?

Standard BPE tokenizers (GPT-2, LLaMA, etc.) prepend spaces to tokens, so "the" and " the" become different tokens with different embeddings. This is efficient for compression but creates semantic inconsistency:

Standard BPE	KL3M Whitespace
`[The][ the][ THE]` = 3 different embeddings	`[The][Ġ][the][Ġ][THE]` = same word, same embedding
"contract" and " contract" are different	"contract" always has ONE embedding

Trade-off

Lower compression ratio (~3.1 chars/token vs ~4.0 for GPT-2 style)
Cleaner semantics - each word maps to a consistent embedding regardless of position

Example: "The United States"

GPT-2 style: [The][ United][ States] (3 tokens, each word has space baked in)

KL3M Whitespace: [The][Ġ][United][Ġ][States] (5 tokens, spaces explicit)

Performance

On legal documents, these tokenizers trade compression for semantic clarity:

Tokenizer	Vocab Size	Compression
KL3M v5-64K	65,536	~3.1 chars/token
KL3M v5-32K	32,768	~3.0 chars/token
KL3M v5-16K	16,384	~2.9 chars/token
KL3M v5-8K	8,192	~2.7 chars/token
KL3M v5-4K	4,096	~2.5 chars/token

Tokenizer Family

This tokenizer is part of a hierarchically nested family. Token IDs in smaller vocabularies are identical across all larger vocabularies, enabling seamless vocabulary expansion:

Vocabulary Size	HuggingFace Repository
4,096 (4K)	alea-institute/kl3m-tokenizer-003-4k
8,192 (8K)	alea-institute/kl3m-tokenizer-003-8k
16,384 (16K)	alea-institute/kl3m-tokenizer-003-16k
32,768 (32K)	alea-institute/kl3m-tokenizer-003-32k
65,536 (64K)	alea-institute/kl3m-tokenizer-003-64k

→ You are viewing: 16,384 (16K)

Key Features

1. Clean Embedding Semantics

By keeping spaces as separate tokens, words have consistent embeddings:

"contract" is always the same token, whether at start of sentence or mid-sentence
No duplicate vocabulary entries for " word" vs "word"
Embedding space maps more directly to semantic meaning

2. Hierarchical Vocabulary Nesting

Token IDs 0-4,095 are identical across all tokenizer sizes. This enables:

Vocabulary expansion during training: Start with 4K vocab, expand to 32K mid-training
Transfer learning: Initialize larger vocab models from smaller vocab checkpoints
Controlled ablations: Compare vocab sizes while maintaining token alignment

3. Legal Domain Optimization

Trained on the KL3M corpus (44GB of legal text):

Court opinions and case law
Contracts and agreements
Patents and IP documents
Legal briefs and filings
Statutory and regulatory text

4. Special Tokens

Seven essential special tokens for language model training:

Token	ID	Purpose
`<\|start\|>`	0	Start of sequence
`<\|end\|>`	1	End of sequence
`<\|pad\|>`	2	Padding token
`<\|unk\|>`	3	Unknown token
`<\|cls\|>`	4	Classification (BERT)
`<\|sep\|>`	5	Separator (BERT)
`<\|mask\|>`	6	Mask token (MLM)

Usage

With Transformers

from transformers import PreTrainedTokenizerFast

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "alea-institute/kl3m-tokenizer-003-16k"
)

# Tokenize text
text = "The Licensor hereby grants to Licensee a non-exclusive license."
tokens = tokenizer.encode(text)
print(f"Tokens: {len(tokens)}")

# Decode
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

With tokenizers Library

from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_pretrained(
    "alea-institute/kl3m-tokenizer-003-16k"
)

# Encode
encoding = tokenizer.encode(text)
print(f"Tokens: {encoding.tokens}")
print(f"IDs: {encoding.ids}")

Training a Model

from transformers import AutoConfig, AutoModelForMaskedLM, PreTrainedTokenizerFast

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "alea-institute/kl3m-tokenizer-003-16k"
)

# Create model config
config = AutoConfig.from_pretrained(
    "bert-base-uncased",
    vocab_size=tokenizer.vocab_size,
)

# Initialize model
model = AutoModelForMaskedLM.from_config(config)

# Train with HuggingFace Trainer...

Technical Details

Training Corpus

Source: KL3M (Kelvin Legal Large Language Model) dataset
Size: ~44.2 GB
Domain: Legal documents (copyright-clean)

Vocabulary Structure

Base vocabulary: 256 bytes + special tokens
Learned merges: BPE merges up to vocab_size
Nesting property: All tokens in size N exist in size 2N
Space handling: Space (Ġ) is a separate token, never merged with words

When to Use These Tokenizers

Use KL3M Whitespace tokenizers when:

You want cleaner word→embedding mappings
Semantic consistency matters more than compression
You're doing embedding analysis or interpretability research
You want words to have position-independent representations

Use standard BPE tokenizers (GPT-2 style) when:

Maximum compression is the priority
You need compatibility with pretrained models
Sequence length is a critical constraint

Limitations

Lower compression: ~25% more tokens than GPT-2 style for same text
Training domain: Optimized for legal English text; may underperform on other domains
Multilingual: Trained primarily on English; limited non-English support

License

MIT License

About ALEA Institute

The ALEA Institute develops open-source tools and datasets for legal AI, including the KL3M corpus and tokenizers.

Related Resources

KL3M Dataset: aleainstitute.ai/work/kl3m
Multi-word Tokenizers: alea-institute/kl3m-multi-word-002-*

Contact

For questions or issues, please visit: github.com/alea-institute

Downloads last month: -; Downloads are not tracked for this model. How to track