image/png

Speechless

Speechless is a compact, open-source text-to-semantics (1B parameters) model, designed to generate direct semantic representations of audio as discrete tokens, bypassing the need for a text-to-speech (TTS) model. Unlike traditional pipelines that rely on generating and processing audio (TTS → ASR), Speechless eliminates this complexity by directly converting text into semantic speech tokens, simplifying training, saving resources, and enabling scalability, especially for low-resource languages.

Trained on over ~400 hours of English and ~1000 hours of Vietnamese data, Speechless is a core component of the Ichigo v0.5 family.

For more details, check out our official blog post.

Model Summary

Developed by: Homebrew Research.

Model Architecture: Llama

Model type: Text to Semantics

Language(s): English and Vietnamese

License: Apache 2.0

Resources

Blog: Blog post

Intended Use

Intended Use Cases This model is primarily designed for research purposes. This version focuses on generating direct semantic representations of audio as discrete tokens, eliminating the need for a text-to-speech (TTS) model.

Out-of-scope The use of Ichigo Whisper in any manner that violates applicable laws or regulations is strictly prohibited.

How to Get Started

You can use given example code to load the model.

import torch
from transformers import pipeline

model_id = "homebrewltd/Speechless-llama3.2-v0.1"

pipe = pipeline(
    "text-generation", 
    model=model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

pipe("<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research")

>>> [{'generated_text': '<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research.assistant\n\n<|sound_1968|><|sound_0464|><|sound_0642|><|duration_02|><|sound_0634|><|sound_0105|><|duration_02|><|sound_1745|><|duration_02|><|sound_1345|><|sound_0210|><|sound_1312|><|sound_1312|>'}]

Training Specs

Parameter Value
Epochs 2
Global Batch Size 144
Learning Rate 3e-4
Learning Scheduler Cosine
Optimizer AdamW
Warmup Ratio 0.05
Weight Decay 0.01
Max Sequence Length 512
Clip Grad Norm 1.0

Evaluation

  1. Vietnamese
Model Name Dataset test Test samples WER
Speechless v0.1 viet_bud500 7500 3.99
  1. English
Model Name Dataset test Test samples WER
Speechless v0.1 librispeech_asr 2620 3.27

Citation Information

BibTeX:

@article{Speechless 2024,
  title={Speechless},
  author={Homebrew Research},
  year=2024,
  month=December},
  url={https://huggingface.co/homebrewltd/Speechless-llama3.2-v0.1}

Acknowledgement

Downloads last month
359
Safetensors
Model size
1.24B params
Tensor type
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Dataset used to train homebrewltd/Speechless-llama3.2-v0.1