Uploaded model
- Developed by: jsbeaudry
- License: apache-2.0
- Finetuned from model : unsloth/csm-1b
sesame-creole-tts
This model is a fine-tuned version of unsloth/csm-1b on the mix of jsbeaudry/creole-text-voice & jsbeaudry/cmu_haitian_creole_speech datasets.
Demo
π§ Model Description
sesame-creole-tts is a text-to-speech (TTS) model designed for Haitian Creole (KreyΓ²l Ayisyen). It is built and fine-tuned using 5,000+ curated audio-text pairs to synthesize, intelligible Creole speech for various use cases including education, accessibility, and conversational AI.
- Trained for: Haitian Creole Text-to-Speech
- Dataset: Over 5,000 Haitian Creole sentence-to-audio pairs
- Voice Type: Male, Female synthetic & natural voices with clear articulation and native accent
- Sampling Rate: 16 kHz
- Phonetics: Uses standardized Creole orthography with support for diacritics
- Objective: Generate natural and expressive Haitian Creole speech for daily communication, education tools, and virtual assistants
π Training and evaluation data
The model was trained on the creole-text-voice dataset, which includes:
- 8 hours of Haitian Creole Synthetic speechs
- Annotated, time-aligned text transcripts following Creole orthography
Model usage script
Inference console
Install packages:
pip install transformers soundfile gradio
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from IPython.display import Audio, display
import soundfile as sf # Import soundfile
model_id = "jsbeaudry/sesame-creole-tts"
device = "cuda" if torch.cuda.is_available() else "cpu"
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
# prepare the inputs
text = "[0]Bonjou tout moun koman nou ye?" # `[0]` for speaker id 0
inputs = processor(text, add_special_tokens=True).to(device)
audio = model.generate(**inputs, output_audio=True)
# Move the audio tensor to the CPU and convert to numpy array before saving with soundfile
audio_numpy = audio[0].to(torch.float32).cpu().numpy()
sf.write("example_without_context.wav", audio_numpy, 24000)
display(Audio(audio_numpy, rate=24000))
Inference with Gradio
Install packages:
pip install transformers soundfile gradio
import gradio as gr
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from IPython.display import Audio, display
import soundfile as sf # Import soundfile
model_id = "jsbeaudry/sesame-creole-tts"
device = "cuda" if torch.cuda.is_available() else "cpu"
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
def text_to_speech(text, speaker_name):
speaker_map = {
"Aleya": 0,
"Mariz": 1,
"Anita": 2,
"Sanit": 3,
"Jak": 4
}
speaker_id = speaker_map[speaker_name]
# prepare the inputs
inputs = processor(f"[{speaker_id}]{text}", add_special_tokens=True).to(device)
# infer the model
audio = model.generate(**inputs, output_audio=True)
# Move the audio tensor to the CPU and convert to numpy array
audio_numpy = audio[0].to(torch.float32).cpu().numpy()
return (24000, audio_numpy)
iface = gr.Interface(
fn=text_to_speech,
inputs=[
gr.Textbox(lines=2, placeholder="Enter Haitian Creole text here..."),
gr.Dropdown(["Aleya", "Mariz", "Anita", "Sanit", "Jak"], label="Select Speaker")
],
outputs=gr.Audio(label="Generated Audio"),
title="Haitian Creole Text-to-Speech",
description="Enter Haitian Creole text to generate speech using the jsbeaudry/sesame-creole-tts model. Select a speaker from the dropdown."
)
iface.launch(debug=True)
Intended uses & limitations
- Mixed texts (Creole + French/English) may produce mispronunciation.
- Long sentences may produce unstable pronunciation.
- Stability of voice selection. A specific voice identifier may not produce the same tone consistently.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-4
- train_batch_size: 2
- seed: 3407
- gradient_accumulation_steps: 4
- optim: adamw_8bit
- lr_scheduler_type: linear
- num_epochs: 3
- training_time: 4:24:03
- num_step: 4080
- Trainable parameters = 29,032,448/1,661,132,609 (1.75% trained)
π Citation
If you use this model, please cite:
@misc{whispermediumcreoleoswald2025,
title={sesame creole tts 11k},
author={Jean sauvenel beaudry},
year={2025},
howpublished={\url{https://huggingface.co/jsbeaudry}}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support