Zlatorog tokenizer (fixed fast decode)

Fast tokenizer for Zlatorog CPT, derived from Qwen3-30B-A3B-Base with an extended Slovenian/Croatian added vocabulary.

What changed

Rust fast decode corrupts some added tokens when a code point’s low byte is ≤ 32 (e.g. č\r). See tokenizers#1996 and the upstream fix in tokenizers#1995.

This repo ships ZlatorogTokenizerFast (tokenization_zlatorog.py), which decodes added tokens the same way as the Transformers 4.x slow Zlatorog tokenizer. Token ids and vocabulary strings are unchanged.

Requirements

  • transformers>=4.45 or >=5.0
  • tokenizers>=0.22
  • trust_remote_code=True (loads ZlatorogTokenizerFast)

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained(
    "zidsi/Zlatorog-30B-MoE-tokenizer",
    trust_remote_code=True,
)

word = "Začnimo"
ids = tok.encode(word, add_special_tokens=False)
assert tok.decode(ids) == word

Use this tokenizer with zidsi/Zlatorog-30B-MoE-CPT_Long (or any checkpoint trained with the same vocabulary).

Audit

321 of 25 893 added tokens were affected on the Hub revision audited for the parent model (a2759ee7565dc7c55c9c93c3f9e72190dcf5def4). See the companion repo’s artifacts/affected_added_tokens.json for the full checklist.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zidsi/Zlatorog-30B-MoE-tokenizer

Finetuned
(1)
this model