π£οΈ French Breaks-to-SSML LoRA Model
hi-paris/ssml-breaks2ssml-fr-lora is a LoRA adapter fine-tuned on Qwen2.5-7B to convert text with symbolic <break/>
markers into rich SSML markup with prosody control (pitch, rate, volume) and precise break timing.
This is the second stage of a two-step SSML cascade pipeline for improving French text-to-speech prosody control.
π Paper: "Improving Synthetic Speech Quality via SSML Prosody Control"
Authors: Nassima Ould-Ouali, Awais Sani, Ruben Bueno, Jonah Dauvet, Tim Luka Horstmann, Eric Moulines
Conference: ICNLSP 2025
π Demo & Audio Samples: https://horstmann.tech/ssml-prosody-control/
π§© Pipeline Overview
Stage | Model | Purpose |
---|---|---|
1οΈβ£ | hi-paris/ssml-text2breaks-fr-lora | Predicts natural pause locations |
2οΈβ£ | hi-paris/ssml-breaks2ssml-fr-lora | Converts breaks to full SSML with prosody |
β¨ Example
Input:
Bonjour comment allez-vous ?<break/>
Output:
<prosody pitch="+2.5%" rate="-1.2%" volume="-5.0%">Bonjour comment allez-vous ?</prosody><break time="300ms"/>
π Quick Start
Installation
pip install torch transformers peft accelerate
Basic Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "hi-paris/ssml-breaks2ssml-fr-lora")
# Prepare input (text with <break/> markers)
text_with_breaks = "Bonjour comment allez-vous ?<break/>"
formatted_input = f"### Task:\nConvert text to SSML with pauses:\n\n### Text:\n{text_with_breaks}\n\n### SSML:\n"
# Generate
inputs = tokenizer(formatted_input, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=128,
temperature=0.3,
do_sample=False,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
result = response.split("### SSML:\n")[-1].strip()
print(result)
Production Usage (Recommended)
For production use with memory optimization, see our inference repository:
from breaks2ssml_inference import Breaks2SSMLInference
# Memory-efficient shared model approach
model = Breaks2SSMLInference()
result = model.predict("Bonjour comment allez-vous ?<break/>")
π§ Full Cascade Example
from breaks2ssml_inference import CascadedInference
# Initialize full pipeline (memory efficient - single base model)
cascade = CascadedInference()
# Convert plain text directly to full SSML
text = "Bonjour comment allez-vous aujourd'hui ?"
ssml_output = cascade.predict(text)
print(ssml_output)
# Output: '<prosody pitch="+2.5%" rate="-1.2%" volume="-5.0%">Bonjour comment allez-vous aujourd'hui ?</prosody><break time="300ms"/>'
π§ Model Details
- Base Model: Qwen/Qwen2.5-7B
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- LoRA Rank: 8, Alpha: 16
- Target Modules:
q_proj
,k_proj
,v_proj
,o_proj
,gate_proj
,up_proj
,down_proj
- Training: 5 epochs, batch size 1 with gradient accumulation
- Language: French
- Model Size: 7B parameters (LoRA adapter: ~81MB)
- License: Apache 2.0
π Performance
Metric | Score |
---|---|
Pause Insertion Accuracy | 87.3% |
RMSE (pause duration) | 98.5 ms |
MOS gain (vs. baseline) | +0.42 |
Evaluation performed on held-out French validation set with annotated SSML pauses. Mean Opinion Score (MOS) improvements assessed using TTS outputs with Azure Henri voice, rated by 30 native French speakers.
π― SSML Features Generated
- Prosody Control: Dynamic pitch, rate, and volume adjustments
- Break Timing: Precise pause durations (e.g.,
<break time="300ms"/>
) - Contextual Adaptation: Prosody values adapted to semantic content
β οΈ Limitations
- Optimized primarily for Azure TTS voices (e.g.,
fr-FR-HenriNeural
) - Requires input text with
<break/>
markers (use Stage 1 model for automatic prediction) - Currently supports break tags only (pitch/rate/volume via prosody wrapper)
π Resources
- Full Pipeline Code: https://github.com/TimLukaHorstmann/cascading_model
- Interactive Demo: Colab Notebook
- Stage 1 Model: hi-paris/ssml-text2breaks-fr-lora
π Citation
@inproceedings{ould-ouali2025_improving,
title = {Improving Synthetic Speech Quality via SSML Prosody Control},
author = {Ould-Ouali, Nassima and Sani, Awais and Bueno, Ruben and Dauvet, Jonah and Horstmann, Tim Luka and Moulines, Eric},
booktitle = {Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP)},
year = {2025},
url = {https://huggingface.co/hi-paris}
}
π License
Apache 2.0 License (same as the base Qwen2.5-7B model)
- Downloads last month
- 128
Model tree for hi-paris/ssml-breaks2ssml-fr-lora
Base model
Qwen/Qwen2.5-7B