YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SPIRIT-LM Expressive Interleaved (Corrected Teacher, Libri-Light)

SPIRIT-LM Expressive Interleaved (Corrected) is a fine-tuned version of the 7B SPIRIT-LM teacher model adapted to the Libri-Light domain. It supports interleaved speech and text inputs, and was used as the teacher model for distilling TinyWave.

This checkpoint was fine-tuned for 10k steps with LoRA adapters on synthetic interleaved data created from Libri-Light and Whisper transcriptions. The resulting model improves alignment with the target distribution and provides stronger supervision for expressive speech–text generation.

πŸ“– This checkpoint is part of the TinyWave distillation framework. See arXiv:2506.23670 for details.


🧠 Model Purpose

Role Distillation Teacher
Base Model spirit-lm-expressive-7b (SPIRIT-LM)
Fine-tuned on Libri-Light (10k steps with LoRA)
Input Modalities Interleaved speech + text
Output Speech tokens
Used for Training tinywave/interleaved-expressive-2b

πŸ”§ Usage

1. Install SPIRIT-LM and Load Expressive Tokenizer

git clone https://github.com/facebookresearch/spiritlm
cd spiritlm
pip install -e '.[eval]'
from spiritlm.speech_tokenizer import spiritlm_expressive
speech_tokenizer = spiritlm_expressive()

2. Inference (Speech or Interleaved)

from transformers import LlamaForCausalLM, AutoTokenizer
import torchaudio
import torch

MODEL_PATH = "tinywave/expressive-spirit-lm-interleaved-librilight"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype=torch.bfloat16)

# Interleaved speech input
speech_tokenizer = spiritlm_expressive()

def get_inference(audio_path):
    audio, _ = torchaudio.load(audio_path)
    input_values = audio.view(1, 1, -1).to(speech_tokenizer.hubert_model.device).float()
    tokens = speech_tokenizer.encode_string(input_values)
    input_ids = tokenizer(tokens, return_tensors="pt").input_ids.to(model.device)
    output = model.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.9, top_p=0.9)
    return tokenizer.decode(output[0])

def get_inference_text(prompt):
    input_ids = tokenizer(prompt + " [Speech]", return_tensors="pt").input_ids.to(model.device)
    output = model.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.9, top_p=0.9)
    return tokenizer.decode(output[0])

🎧 Inference Modes

πŸ’¬ Text + Speech Interleaving

Input:

"The astronaut stepped outside the capsuleβ€” [Speech]"

Output: Expressive speech continuation in WAV format.


πŸ”„ Speech Continuation

Input: speech.wav Output: Semantically and stylistically aligned spoken continuation.


πŸ“‚ Files

  • pytorch_model.bin: LoRA-adapted SPIRIT-LM 7B weights
  • config.json, tokenizer.json: Compatible with Hugging Face Transformers
  • Compatible with spiritlm_expressive tokenizer only

πŸ“Ž Citation

@article{nouriborji2025tinywave,
  title={Efficient Interleaved Speech Modeling through Knowledge Distillation},
  author={Nouriborji, Mohammadmahdi and Rohanian, Morteza},
  journal={arXiv preprint arXiv:2506.23670},
  year={2025}
}

πŸ”— Related

Downloads last month
2
Safetensors
Model size
7B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support