Hindi BPE Tokenizer

A Byte Pair Encoding (BPE) tokenizer specifically designed for Hindi text, using grapheme-level tokenization to properly handle Devanagari script.

Model Description

This tokenizer implements the BPE algorithm at the grapheme level rather than byte level, which is crucial for Hindi because:

  • Hindi uses combining characters (matras) that attach to consonants
  • Proper handling of conjuncts (consonant clusters)
  • Preserves linguistic meaning better than byte-level tokenization

Training Details

  • Training corpus size: 60,565 UTF-8 bytes
  • Number of merges: 4,000
  • Vocabulary size: 4,629
  • Grapheme compression ratio: 5.28x
  • Byte compression ratio: 18.62x

Performance

Compression Ratios

The tokenizer achieves excellent compression while maintaining perfect reversibility:

  • Raw graphemes: 17,179
  • Compressed tokens: 3,253
  • Grapheme compression: 5.28x
  • Byte compression: 18.62x

Usage

from hindi_tokenizer import HindiTokenizer
import pickle

# Load the tokenizer
with open('hindi_tokenizer.pkl', 'rb') as f:
    tokenizer = pickle.load(f)

# Encode text
text = "मार्केट ट्रेंड्स को समझना"
encoded = tokenizer.encode(text)
print(f"Token IDs: {encoded}")

# Decode back
decoded = tokenizer.decode(encoded)
print(f"Decoded: {decoded}")

# Get compression stats
stats = tokenizer.get_compression_stats(text)
print(f"Compression: {stats['byte_compression_ratio']:.2f}x")

Installation

pip install regex

Limitations and Bias

This tokenizer is trained on Hindi text and is optimized for Devanagari script. Performance may vary on:

  • Mixed language text (Hindi-English code-switching)
  • Other Indic languages using Devanagari script
  • Informal or social media text with non-standard spellings

Citation

If you use this tokenizer in your research, please cite:

@software{hindi_bpe_tokenizer,
  title = {Hindi BPE Tokenizer},
  author = {agileabhi},
  year = {2025},
  url = {https://github.com/agileabhi/hindi-bpe-tokenizer}
}

Repository

GitHub Repository

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support