KL3M Whitespace Tokenizer Experiment - 16K

This is the 16,384 token variant of the KL3M (Kelvin Legal Large Language Model) whitespace tokenizer experiment, trained on legal domain text with separate space tokens for cleaner word embeddings.

Overview

The KL3M whitespace tokenizers v5 are a family of byte-pair encoding (BPE) tokenizers trained on ~44GB of legal domain text from the KL3M dataset (copyright-clean legal corpus from the ALEA Institute). These tokenizers:

  • Treat spaces as separate tokens (no Ġ prefix merging)
  • Provide cleaner embedding→semantic mappings where each word has a consistent representation
  • Use hierarchical vocabulary nesting where smaller vocabularies are proper subsets of larger ones
  • Enable vocabulary expansion experiments and transfer learning across vocabulary sizes

Whitespace Tokenization Design

Key difference from GPT-2 style tokenizers: These tokenizers do NOT merge spaces with following words. Instead, spaces get their own token (Ġ), and words are tokenized independently without the space prefix.

Why This Design?

Standard BPE tokenizers (GPT-2, LLaMA, etc.) prepend spaces to tokens, so "the" and " the" become different tokens with different embeddings. This is efficient for compression but creates semantic inconsistency:

Standard BPE KL3M Whitespace
[The][ the][ THE] = 3 different embeddings [The][Ġ][the][Ġ][THE] = same word, same embedding
"contract" and " contract" are different "contract" always has ONE embedding

Trade-off

  • Lower compression ratio (~3.1 chars/token vs ~4.0 for GPT-2 style)
  • Cleaner semantics - each word maps to a consistent embedding regardless of position

Example: "The United States"

GPT-2 style: [The][ United][ States] (3 tokens, each word has space baked in)

KL3M Whitespace: [The][Ġ][United][Ġ][States] (5 tokens, spaces explicit)

Performance

On legal documents, these tokenizers trade compression for semantic clarity:

Tokenizer Vocab Size Compression
KL3M v5-64K 65,536 ~3.1 chars/token
KL3M v5-32K 32,768 ~3.0 chars/token
KL3M v5-16K 16,384 ~2.9 chars/token
KL3M v5-8K 8,192 ~2.7 chars/token
KL3M v5-4K 4,096 ~2.5 chars/token

Tokenizer Family

This tokenizer is part of a hierarchically nested family. Token IDs in smaller vocabularies are identical across all larger vocabularies, enabling seamless vocabulary expansion:

→ You are viewing: 16,384 (16K)

Key Features

1. Clean Embedding Semantics

By keeping spaces as separate tokens, words have consistent embeddings:

  • "contract" is always the same token, whether at start of sentence or mid-sentence
  • No duplicate vocabulary entries for " word" vs "word"
  • Embedding space maps more directly to semantic meaning

2. Hierarchical Vocabulary Nesting

Token IDs 0-4,095 are identical across all tokenizer sizes. This enables:

  • Vocabulary expansion during training: Start with 4K vocab, expand to 32K mid-training
  • Transfer learning: Initialize larger vocab models from smaller vocab checkpoints
  • Controlled ablations: Compare vocab sizes while maintaining token alignment

3. Legal Domain Optimization

Trained on the KL3M corpus (44GB of legal text):

  • Court opinions and case law
  • Contracts and agreements
  • Patents and IP documents
  • Legal briefs and filings
  • Statutory and regulatory text

4. Special Tokens

Seven essential special tokens for language model training:

Token ID Purpose
<|start|> 0 Start of sequence
<|end|> 1 End of sequence
<|pad|> 2 Padding token
<|unk|> 3 Unknown token
<|cls|> 4 Classification (BERT)
<|sep|> 5 Separator (BERT)
<|mask|> 6 Mask token (MLM)

Usage

With Transformers

from transformers import PreTrainedTokenizerFast

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "alea-institute/kl3m-tokenizer-003-16k"
)

# Tokenize text
text = "The Licensor hereby grants to Licensee a non-exclusive license."
tokens = tokenizer.encode(text)
print(f"Tokens: {len(tokens)}")

# Decode
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

With tokenizers Library

from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_pretrained(
    "alea-institute/kl3m-tokenizer-003-16k"
)

# Encode
encoding = tokenizer.encode(text)
print(f"Tokens: {encoding.tokens}")
print(f"IDs: {encoding.ids}")

Training a Model

from transformers import AutoConfig, AutoModelForMaskedLM, PreTrainedTokenizerFast

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "alea-institute/kl3m-tokenizer-003-16k"
)

# Create model config
config = AutoConfig.from_pretrained(
    "bert-base-uncased",
    vocab_size=tokenizer.vocab_size,
)

# Initialize model
model = AutoModelForMaskedLM.from_config(config)

# Train with HuggingFace Trainer...

Technical Details

Training Corpus

  • Source: KL3M (Kelvin Legal Large Language Model) dataset
  • Size: ~44.2 GB
  • Domain: Legal documents (copyright-clean)

Vocabulary Structure

  • Base vocabulary: 256 bytes + special tokens
  • Learned merges: BPE merges up to vocab_size
  • Nesting property: All tokens in size N exist in size 2N
  • Space handling: Space (Ġ) is a separate token, never merged with words

When to Use These Tokenizers

Use KL3M Whitespace tokenizers when:

  • You want cleaner word→embedding mappings
  • Semantic consistency matters more than compression
  • You're doing embedding analysis or interpretability research
  • You want words to have position-independent representations

Use standard BPE tokenizers (GPT-2 style) when:

  • Maximum compression is the priority
  • You need compatibility with pretrained models
  • Sequence length is a critical constraint

Limitations

  • Lower compression: ~25% more tokens than GPT-2 style for same text
  • Training domain: Optimized for legal English text; may underperform on other domains
  • Multilingual: Trained primarily on English; limited non-English support

License

MIT License

About ALEA Institute

The ALEA Institute develops open-source tools and datasets for legal AI, including the KL3M corpus and tokenizers.

Related Resources

Contact

For questions or issues, please visit: github.com/alea-institute

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support