🗣️ ssml-break2ssml-fr-lora

This is the second-stage LoRA adapter for French SSML generation, converting pause-annotated text into full SSML markup with <break> tags.

This model is part of the cascade described in the paper:

"Improving French Synthetic Speech Quality via SSML Prosody Control" Nassima Ould-Ouali, Éric Moulines – ICNLSP 2025 (Springer LNCS) [accepted].

🧠 Model Details

Base model: Qwen/Qwen2.5-7B
Adapter method: LoRA (Low-Rank Adaptation via peft)
LoRA rank: 8 — Alpha: 16
Training: 5 epochs, batch size 1 (gradient accumulation)
Languages: French
Model size: 7B (adapter-only)
License: Apache 2.0

🧩 Pipeline Overview

This model is part of a two-stage SSML cascade for improving French TTS prosody:

Step	Model	Description
1️⃣	`nassimaODL/ssml-text2breaks-fr-lora`	Inserts symbolic pauses like `#250`, `#500`
2️⃣	`nassimaODL/ssml-break2ssml-fr-lora`	Converts symbols to `<break time="..."/>` SSML

✨ Example

Input:  Bonjour#250 comment vas-tu ?
Output: Bonjour<break time="250ms"/> comment vas-tu ?

🚀 How to run the code


from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B", device_map="auto")
model = PeftModel.from_pretrained(base_model, "nassimaODL/ssml-break2ssml-fr-lora")

input_text = "Bonjour#250 comment vas-tu ?"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🧪 Evaluation Summary

Metric	Value
Pause Insertion Accuracy	87.3%
RMSE (pause duration)	98.5 ms
MOS gain (vs. baseline)	+0.42

Evaluation was performed on a held-out French validation set with annotated SSML pauses. Mean Opinion Score (MOS) improvements were assessed using TTS outputs rendered with Azure Henri voice and rated by 30 native French speakers.

📚 Training Data

This LoRA adapter was trained on a corpus of ~4,500 French utterances. Input texts were annotated with symbolic pause indicators (e.g., #250 for 250ms), automatically aligned using a combination of Whisper-Kyutai timestamping and F0/syntactic heuristics.

Annotations were refined via a hybrid heuristic rule set combining:

Voice activity boundaries (via Auditok)
F0 contour analysis (pitch dips before breaks)
Syntactic cues (punctuation, conjunctions)

For full details, see our data preparation pipeline on GitHub:
🔗 https://github.com/NassimaOULDOUALI/Prosody-Control-French-TTS

⚙️ Training Setup

Compute: Jean-Zay (GENCI/IDRIS), A100 80GB x1
Framework: HuggingFace transformers + peft
LoRA method: rank = 8, alpha = 16, dropout = 0.05
Precision: bf16
Max sequence length: 768 tokens (256 input + 512 output)
Epochs: 5
Optimizer: AdamW (lr = 2e-4, no warmup)
LoRA target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training was performed using the Unsloth SFTTrainer and PEFT adapter injection on Qwen2.5-7B base.

⚠️ Limitations

Only <break> tags are supported; no pitch, rate, or emphasis control yet.
Pause accuracy is sensitive to punctuation and malformed inputs.
SSML output has been optimized primarily for Azure voices (e.g., fr-FR-HenriNeural). Other engines may interpret <break> tags differently.
The model assumes the presence of symbolic pause markers in the input (e.g., #250). For automatic prediction of such symbols, refer to our stage-1 model:
🔗 nassimaODL/ssml-text2breaks-fr-lora

📖 Citation

@inproceedings{ould-ouali2025improving, author = {Nassima Ould-Ouali and Awais Sani and Tim Luka Horstmann and Jonah Dauvet and Ruben Bueno and Éric Moulines}, title = {Improving French Synthetic Speech Quality via SSML Prosody Control}, booktitle = {Proceedings of the 9th International Conference on Natural Language and Speech Processing (ICNLSP)}, series = {Lecture Notes in Computer Science}, publisher = {Springer}, year = {2025}, note = {To appear} }

nassimaODL
/

ssml-breaks2ssml-fr-lora