MuASR-3B
An ASR model for music. This is the public checkpoint.
Features:
- Captions the music with tags (Suno-style)
- Transcription of lyrics into verses and sections, with annotations (e.g. [Intro], [Verse 1], [Chorus], [Outro], etc.)
Limitations:
- Hallucinations
Usage
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
repo_id = "mrfakename/MuASR-3B"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, dtype=torch.bfloat16, device_map=device)
inputs = processor.apply_transcription_request(language="en", audio="assets/song_full.mp3", model_id=repo_id)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(
**inputs,
max_new_tokens=500,
do_sample=True
)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated responses:")
print("=" * 80)
for decoded_output in decoded_outputs:
print(decoded_output)
print("=" * 80)
It has a serious hallucination problem at the moment, just a PoC for now.
Please reach out to me (realmrfakename on Discord) if you are planning to use this model ๐
- Downloads last month
- 10