Uploaded model

Developed by: jsbeaudry
License: apache-2.0
Finetuned from model : unsloth/csm-1b

sesame-creole-tts

This model is a fine-tuned version of unsloth/csm-1b on the mix of jsbeaudry/creole-text-voice & jsbeaudry/cmu_haitian_creole_speech datasets.

Demo

Colab Demo

🧠 Model Description

sesame-creole-tts is a text-to-speech (TTS) model designed for Haitian Creole (Kreyòl Ayisyen). It is built and fine-tuned using 5,000+ curated audio-text pairs to synthesize, intelligible Creole speech for various use cases including education, accessibility, and conversational AI.

Trained for: Haitian Creole Text-to-Speech
Dataset: Over 5,000 Haitian Creole sentence-to-audio pairs
Voice Type: Male, Female synthetic & natural voices with clear articulation and native accent
Sampling Rate: 16 kHz
Phonetics: Uses standardized Creole orthography with support for diacritics
Objective: Generate natural and expressive Haitian Creole speech for daily communication, education tools, and virtual assistants

📊 Training and evaluation data

The model was trained on the creole-text-voice dataset, which includes:

8 hours of Haitian Creole Synthetic speechs
Annotated, time-aligned text transcripts following Creole orthography

Model usage script

Inference console

Install  packages:
pip install transformers soundfile gradio


import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from IPython.display import Audio, display
import soundfile as sf # Import soundfile

model_id = "jsbeaudry/sesame-creole-tts"
device = "cuda" if torch.cuda.is_available() else "cpu"

# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)


# prepare the inputs
text = "[0]Bonjou tout moun koman nou ye?" # `[0]` for speaker id 0
inputs = processor(text, add_special_tokens=True).to(device)

audio = model.generate(**inputs, output_audio=True)

# Move the audio tensor to the CPU and convert to numpy array before saving with soundfile
audio_numpy = audio[0].to(torch.float32).cpu().numpy()

sf.write("example_without_context.wav", audio_numpy, 24000)
display(Audio(audio_numpy, rate=24000))

Inference with Gradio

Install  packages:
pip install transformers soundfile gradio


import gradio as gr
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from IPython.display import Audio, display
import soundfile as sf # Import soundfile

model_id = "jsbeaudry/sesame-creole-tts"
device = "cuda" if torch.cuda.is_available() else "cpu"

# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

def text_to_speech(text, speaker_name):
    speaker_map = {
         "Aleya": 0,
        "Mariz": 1,
        "Anita": 2,
        "Sanit": 3,
        "Jak": 4
    }
    speaker_id = speaker_map[speaker_name]
    # prepare the inputs
    inputs = processor(f"[{speaker_id}]{text}", add_special_tokens=True).to(device)
    # infer the model
    audio = model.generate(**inputs, output_audio=True)
    # Move the audio tensor to the CPU and convert to numpy array
    audio_numpy = audio[0].to(torch.float32).cpu().numpy()
    return (24000, audio_numpy)

iface = gr.Interface(
    fn=text_to_speech,
    inputs=[
        gr.Textbox(lines=2, placeholder="Enter Haitian Creole text here..."),
        gr.Dropdown(["Aleya", "Mariz", "Anita", "Sanit", "Jak"], label="Select Speaker")
    ],
    outputs=gr.Audio(label="Generated Audio"),
    title="Haitian Creole Text-to-Speech",
    description="Enter Haitian Creole text to generate speech using the jsbeaudry/sesame-creole-tts model. Select a speaker from the dropdown."
)

iface.launch(debug=True)

Intended uses & limitations

Mixed texts (Creole + French/English) may produce mispronunciation.
Long sentences may produce unstable pronunciation.
Stability of voice selection. A specific voice identifier may not produce the same tone consistently.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-4
train_batch_size: 2
seed: 3407
gradient_accumulation_steps: 4
optim: adamw_8bit
lr_scheduler_type: linear
num_epochs: 3
training_time: 4:24:03
num_step: 4080
Trainable parameters = 29,032,448/1,661,132,609 (1.75% trained)

📌 Citation

If you use this model, please cite:

@misc{whispermediumcreoleoswald2025,
  title={sesame creole tts 11k},
  author={Jean sauvenel beaudry},
  year={2025},
  howpublished={\url{https://huggingface.co/jsbeaudry}}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for jsbeaudry/sesame-creole-tts

Base model

sesame/csm-1b

Finetuned

unsloth/csm-1b