[Fine-tuning for other languages] How to obtain or train "qwen-tts-tokenizer" used in Qwen2.5-Omni?
I'm currently exploring the Qwen2.5-Omni multimodal model (as described in the Qwen2.5-Omni Technical Report and particularly interested in adapting it to generate speech in languages beyond English and Chinese, specifically Korean.
The Qwen2.5-Omni paper describes that the Talker module generates speech using discrete codec tokens produced by an audio tokenizer named "qwen-tts-tokenizer":
βWe designed an efficient speech codec named qwen-tts-tokenizer. qwen-tts-tokenizer efficiently represents key information of speech and can be decoded to speech streamingly through a causal audio decoder.β
However, the publicly available implementation (e.g., on Hugging Face) appears to only include the decoder part of the model pipeline:
- Talker: Text β speech codec tokens
- Token2Wav: Speech codec tokens β mel-spectrogram β waveform
I couldn't find any publicly accessible method or code snippet for the reverse step (waveform β codec tokens) using the mentioned "qwen-tts-tokenizer".
This leads to a few questions:
Is the "qwen-tts-tokenizer" publicly available or open-sourced?
If yes, could you point me to the repository or implementation?If not publicly available, are there plans to release the tokenizer or its pretrained checkpoints?
Alternatively, could you share what kind of model architecture or codec method (e.g., EnCodec, SoundStream, VQ-VAE) the tokenizer uses?
This information would be extremely helpful to recreate or fine-tune a similar tokenizer.
Having access to this tokenizer is critical to training the Talker module for new languages (like Korean), as otherwise, creating the necessary training data (text β codec token pairs) is not possible.
I appreciate any guidance, details, or alternative recommendations you can provide.
Thank you!
cc. @littlebird13 , @xiongwang , @Jin-xu
could you solve this issue ?
Hey, in case you're interested, I've come this far by seeing that both the talker and the token2wav influence the model and how to access them to train it.
from peft import LoraConfig, get_peft_model
from transformers import Qwen2_5OmniForConditionalGeneration
model = Qwen2_5OmniForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-Omni-3B", torch_dtype="auto", device_map="auto")
print("="*60)
print("Freezing visual and audio encoders model parameters...")
for param in model.thinker.audio_tower.parameters():
param.requires_grad = False
for param in model.thinker.visual.parameters():
param.requires_grad = False
for param in model.token2wav.code2wav_bigvgan_model.parameters():
param.requires_grad = False
thinker_target_modules = [
"q_proj", # Query projection
"k_proj", # Key projection
"v_proj", # Value projection
"o_proj", # Output projection
# MLP projections
"gate_proj", # Gate projection (SwiGLU)
"up_proj", # Up projection
"down_proj", # Down projection
]
talker_target_modules = [
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
]
lora_config_thinker = LoraConfig(
r=16,
lora_alpha=32,
target_modules=thinker_target_modules,
lora_dropout=0.05,
bias="none",
modules_to_save=None,
)
print("="*60)
print("Thinker LoRA Config Applied:")
get_peft_model(model.thinker.model, lora_config_thinker).print_trainable_parameters()
lora_config_talker = LoraConfig(
r=16,
lora_alpha=32,
target_modules=talker_target_modules,
lora_dropout=0.05,
bias="none",
modules_to_save=None,
)
print("="*60)
print("Talker LoRA Config Applied:")
get_peft_model(model.talker.model, lora_config_talker).print_trainable_parameters()
token2wav_lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=[
"to_q", # In each of the 22 DiT layers
"to_k",
"to_v",
],
lora_dropout=0.05,
bias="none",
)
print("="*60)
print("Token2Wav LoRA Config Applied:")
get_peft_model(
model.token2wav.code2wav_dit_model,
token2wav_lora_config
).print_trainable_parameters()
try:
for param in model.token2wav.code2wav_dit_model.input_embed.spk_encoder.parameters():
param.requires_grad = True
print("="*60)
print("Speaker Encoder training activated.")
except Exception as e:
print("Failed to activate speaker encoder training:", e)