SPIRIT-LM Expressive Interleaved (Corrected Teacher, Libri-Light)
SPIRIT-LM Expressive Interleaved (Corrected) is a fine-tuned version of the 7B SPIRIT-LM teacher model adapted to the Libri-Light domain. It supports interleaved speech and text inputs, and was used as the teacher model for distilling TinyWave.
This checkpoint was fine-tuned for 10k steps with LoRA adapters on synthetic interleaved data created from Libri-Light and Whisper transcriptions. The resulting model improves alignment with the target distribution and provides stronger supervision for expressive speechβtext generation.
π This checkpoint is part of the TinyWave distillation framework. See arXiv:2506.23670 for details.
π§ Model Purpose
| Role | Distillation Teacher |
|---|---|
| Base Model | spirit-lm-expressive-7b (SPIRIT-LM) |
| Fine-tuned on | Libri-Light (10k steps with LoRA) |
| Input Modalities | Interleaved speech + text |
| Output | Speech tokens |
| Used for | Training tinywave/interleaved-expressive-2b |
π§ Usage
1. Install SPIRIT-LM and Load Expressive Tokenizer
git clone https://github.com/facebookresearch/spiritlm
cd spiritlm
pip install -e '.[eval]'
from spiritlm.speech_tokenizer import spiritlm_expressive
speech_tokenizer = spiritlm_expressive()
2. Inference (Speech or Interleaved)
from transformers import LlamaForCausalLM, AutoTokenizer
import torchaudio
import torch
MODEL_PATH = "tinywave/expressive-spirit-lm-interleaved-librilight"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype=torch.bfloat16)
# Interleaved speech input
speech_tokenizer = spiritlm_expressive()
def get_inference(audio_path):
audio, _ = torchaudio.load(audio_path)
input_values = audio.view(1, 1, -1).to(speech_tokenizer.hubert_model.device).float()
tokens = speech_tokenizer.encode_string(input_values)
input_ids = tokenizer(tokens, return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.9, top_p=0.9)
return tokenizer.decode(output[0])
def get_inference_text(prompt):
input_ids = tokenizer(prompt + " [Speech]", return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.9, top_p=0.9)
return tokenizer.decode(output[0])
π§ Inference Modes
π¬ Text + Speech Interleaving
Input:
"The astronaut stepped outside the capsuleβ [Speech]"
Output: Expressive speech continuation in WAV format.
π Speech Continuation
Input: speech.wav
Output: Semantically and stylistically aligned spoken continuation.
π Files
pytorch_model.bin: LoRA-adapted SPIRIT-LM 7B weightsconfig.json,tokenizer.json: Compatible with Hugging Face Transformers- Compatible with
spiritlm_expressivetokenizer only
π Citation
@article{nouriborji2025tinywave,
title={Efficient Interleaved Speech Modeling through Knowledge Distillation},
author={Nouriborji, Mohammadmahdi and Rohanian, Morteza},
journal={arXiv preprint arXiv:2506.23670},
year={2025}
}
π Related
- π¬ Paper: arXiv:2506.23670
- π§ Student model:
tinywave/interleaved-expressive-2b - π Project Website
- Downloads last month
- 2