--- license: llama2 library_name: transformers --- # CJK Tokenizer A BPE tokenizer that extends the LLaMA tokenizer vocabulary with all Unicode CJK characters. This tokenizer is designed for ablation experiments to evaluate the impact of using **character-level** tokenization instead of subword-level tokenization on the semantic understanding of CJK text. ## Features - **Extended Vocabulary** - Base: [LLaMA tokenizer](https://github.com/meta-llama/llama) - Added: 16,689 unique CJK characters covering the full Unicode CJK blocks - **Character-Level** - Each CJK character is treated as its own token - Other scripts (Latin, punctuation, etc.) follow the original LLaMA subword scheme ## Installation ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("AgentBull/CJK-Tokenizer") text = "天地玄黃, 宇宙洪荒。 日月盈昃, 辰宿列張。" tokens = tokenizer(text) print(tokens.tokens) ``` Tokenizer Details Property Value Base tokenizer LLaMA tokenizer Vocabulary size 16,689 Added tokens All CJK Unified Ideographs (Unicode) Special tokens Inherited from LLaMA (, , , etc.) Experimental Purpose This tokenizer was created for ablation studies in CJK language modeling: 1. Hypothesis: Character-level tokenization of CJK scripts may improve or alter model semantic understanding compared to subword tokenization. 2. Method: • Train or fine-tune identical LLM architectures using (a) the standard LLaMA tokenizer, and (b) this CJKCharTokenizer. • Compare downstream performance on tasks such as language modeling, classification, and machine translation. 3. Analysis: Evaluate metrics like perplexity, downstream task accuracy, and qualitative behavior on CJK-specific phenomena. Contributing Contributions, bug reports, and feature requests are welcome! Please open an issue or submit a pull request on GitHub. License This project is licensed under the MIT License. Citation If you use this tokenizer in your research or projects, please cite: @misc{CJKCharTokenizer2025, author = {Your Name}, title = {CJKCharTokenizer: A Character-Level CJK Tokenizer for LLaMA}, year = {2025}, howpublished = {\\url{https://huggingface.co/AgentBull/CJK-Tokenizer}}, }