[Fine-tuning for other languages] How to obtain or train "qwen-tts-tokenizer" used in Qwen2.5-Omni?

#40

by Seungyoun - opened Apr 15

Apr 15

•

I'm currently exploring the Qwen2.5-Omni multimodal model (as described in the Qwen2.5-Omni Technical Report and particularly interested in adapting it to generate speech in languages beyond English and Chinese, specifically Korean.

The Qwen2.5-Omni paper describes that the Talker module generates speech using discrete codec tokens produced by an audio tokenizer named "qwen-tts-tokenizer":

“We designed an efficient speech codec named qwen-tts-tokenizer. qwen-tts-tokenizer efficiently represents key information of speech and can be decoded to speech streamingly through a causal audio decoder.”

However, the publicly available implementation (e.g., on Hugging Face) appears to only include the decoder part of the model pipeline:

Talker: Text → speech codec tokens
Token2Wav: Speech codec tokens → mel-spectrogram → waveform

I couldn't find any publicly accessible method or code snippet for the reverse step (waveform → codec tokens) using the mentioned "qwen-tts-tokenizer".

This leads to a few questions:

Is the "qwen-tts-tokenizer" publicly available or open-sourced?
If yes, could you point me to the repository or implementation?
If not publicly available, are there plans to release the tokenizer or its pretrained checkpoints?
Alternatively, could you share what kind of model architecture or codec method (e.g., EnCodec, SoundStream, VQ-VAE) the tokenizer uses?
This information would be extremely helpful to recreate or fine-tune a similar tokenizer.

Having access to this tokenizer is critical to training the Talker module for new languages (like Korean), as otherwise, creating the necessary training data (text ↔ codec token pairs) is not possible.

I appreciate any guidance, details, or alternative recommendations you can provide.

Thank you!

cc. @littlebird13 , @xiongwang , @Jin-xu

Reubencf

Aug 31

could you solve this issue ?

Fredtt3

Oct 24

•

edited Oct 24

Hey, in case you're interested, I've come this far by seeing that both the talker and the token2wav influence the model and how to access them to train it.

from peft import LoraConfig, get_peft_model
from transformers import Qwen2_5OmniForConditionalGeneration

model = Qwen2_5OmniForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-Omni-3B", torch_dtype="auto", device_map="auto")

print("="*60)
print("Freezing visual and audio encoders model parameters...")
for param in model.thinker.audio_tower.parameters():
    param.requires_grad = False


for param in model.thinker.visual.parameters():
    param.requires_grad = False

for param in model.token2wav.code2wav_bigvgan_model.parameters():
    param.requires_grad = False


thinker_target_modules = [
    "q_proj",      # Query projection
    "k_proj",      # Key projection
    "v_proj",      # Value projection
    "o_proj",      # Output projection
    
    # MLP projections
    "gate_proj",   # Gate projection (SwiGLU)
    "up_proj",     # Up projection
    "down_proj",   # Down projection
]

talker_target_modules = [
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
    "gate_proj",
    "up_proj",
    "down_proj",
]


lora_config_thinker = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=thinker_target_modules,
    lora_dropout=0.05,
    bias="none",
    modules_to_save=None,
)

print("="*60)
print("Thinker LoRA Config Applied:")
get_peft_model(model.thinker.model, lora_config_thinker).print_trainable_parameters()

lora_config_talker = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=talker_target_modules,
    lora_dropout=0.05,
    bias="none",
    modules_to_save=None,
)

print("="*60)
print("Talker LoRA Config Applied:")
get_peft_model(model.talker.model, lora_config_talker).print_trainable_parameters()

token2wav_lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "to_q",  # In each of the 22 DiT layers
        "to_k",
        "to_v",
    ],
    lora_dropout=0.05,
    bias="none",
)

print("="*60)
print("Token2Wav LoRA Config Applied:")
get_peft_model(
    model.token2wav.code2wav_dit_model,
    token2wav_lora_config
).print_trainable_parameters()

try:
    for param in model.token2wav.code2wav_dit_model.input_embed.spk_encoder.parameters():
        param.requires_grad = True

    print("="*60)
    print("Speaker Encoder training activated.")
except Exception as e:
    print("Failed to activate speaker encoder training:", e)

Reubencf

18 days ago

wow thanks a lot @John6666 seems like someone figured it out

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment