YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SPIRIT-LM Expressive Interleaved (Corrected Teacher, Libri-Light)

SPIRIT-LM Expressive Interleaved (Corrected) is a fine-tuned version of the 7B SPIRIT-LM teacher model adapted to the Libri-Light domain. It supports interleaved speech and text inputs, and was used as the teacher model for distilling TinyWave.

This checkpoint was fine-tuned for 10k steps with LoRA adapters on synthetic interleaved data created from Libri-Light and Whisper transcriptions. The resulting model improves alignment with the target distribution and provides stronger supervision for expressive speech–text generation.

📖 This checkpoint is part of the TinyWave distillation framework. See arXiv:2506.23670 for details.

🧠 Model Purpose

Role	Distillation Teacher
Base Model	`spirit-lm-expressive-7b` (SPIRIT-LM)
Fine-tuned on	Libri-Light (10k steps with LoRA)
Input Modalities	Interleaved speech + text
Output	Speech tokens
Used for	Training `tinywave/interleaved-expressive-2b`

🔧 Usage

1. Install SPIRIT-LM and Load Expressive Tokenizer

git clone https://github.com/facebookresearch/spiritlm
cd spiritlm
pip install -e '.[eval]'

from spiritlm.speech_tokenizer import spiritlm_expressive
speech_tokenizer = spiritlm_expressive()

2. Inference (Speech or Interleaved)

from transformers import LlamaForCausalLM, AutoTokenizer
import torchaudio
import torch

MODEL_PATH = "tinywave/expressive-spirit-lm-interleaved-librilight"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype=torch.bfloat16)

# Interleaved speech input
speech_tokenizer = spiritlm_expressive()

def get_inference(audio_path):
    audio, _ = torchaudio.load(audio_path)
    input_values = audio.view(1, 1, -1).to(speech_tokenizer.hubert_model.device).float()
    tokens = speech_tokenizer.encode_string(input_values)
    input_ids = tokenizer(tokens, return_tensors="pt").input_ids.to(model.device)
    output = model.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.9, top_p=0.9)
    return tokenizer.decode(output[0])

def get_inference_text(prompt):
    input_ids = tokenizer(prompt + " [Speech]", return_tensors="pt").input_ids.to(model.device)
    output = model.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.9, top_p=0.9)
    return tokenizer.decode(output[0])

🎧 Inference Modes

💬 Text + Speech Interleaving

Input:

"The astronaut stepped outside the capsule— [Speech]"

Output: Expressive speech continuation in WAV format.

🔄 Speech Continuation

Input: speech.wav Output: Semantically and stylistically aligned spoken continuation.

📂 Files

pytorch_model.bin: LoRA-adapted SPIRIT-LM 7B weights
config.json, tokenizer.json: Compatible with Hugging Face Transformers
Compatible with spiritlm_expressive tokenizer only

📎 Citation

@article{nouriborji2025tinywave,
  title={Efficient Interleaved Speech Modeling through Knowledge Distillation},
  author={Nouriborji, Mohammadmahdi and Rohanian, Morteza},
  journal={arXiv preprint arXiv:2506.23670},
  year={2025}
}

🔗 Related

🔬 Paper: arXiv:2506.23670
🧠 Student model: tinywave/interleaved-expressive-2b
🌐 Project Website

Downloads last month: 2

Safetensors

Model size

7B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support