You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

MuASR-3B

An ASR model for music. This is the public checkpoint.

Features:

Captions the music with tags (Suno-style)
Transcription of lyrics into verses and sections, with annotations (e.g. [Intro], [Verse 1], [Chorus], [Outro], etc.)

Limitations:

Hallucinations

Usage

from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch

device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
repo_id = "mrfakename/MuASR-3B"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, dtype=torch.bfloat16, device_map=device)

inputs = processor.apply_transcription_request(language="en", audio="assets/song_full.mp3", model_id=repo_id)
inputs = inputs.to(device, dtype=torch.bfloat16)

outputs = model.generate(
    **inputs,
    max_new_tokens=500,
    do_sample=True
)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)

print("\nGenerated responses:")
print("=" * 80)
for decoded_output in decoded_outputs:
    print(decoded_output)
    print("=" * 80)

It has a serious hallucination problem at the moment, just a PoC for now.

Please reach out to me (realmrfakename on Discord) if you are planning to use this model 🙂

Downloads last month: 10

Safetensors

Model size

684k params

Tensor type

BF16