You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

MuASR-3B

An ASR model for music. This is the public checkpoint.

Features:

  • Captions the music with tags (Suno-style)
  • Transcription of lyrics into verses and sections, with annotations (e.g. [Intro], [Verse 1], [Chorus], [Outro], etc.)

Limitations:

  • Hallucinations

Usage

from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch

device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
repo_id = "mrfakename/MuASR-3B"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, dtype=torch.bfloat16, device_map=device)

inputs = processor.apply_transcription_request(language="en", audio="assets/song_full.mp3", model_id=repo_id)
inputs = inputs.to(device, dtype=torch.bfloat16)

outputs = model.generate(
    **inputs,
    max_new_tokens=500,
    do_sample=True
)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)

print("\nGenerated responses:")
print("=" * 80)
for decoded_output in decoded_outputs:
    print(decoded_output)
    print("=" * 80)

It has a serious hallucination problem at the moment, just a PoC for now.

Please reach out to me (realmrfakename on Discord) if you are planning to use this model ๐Ÿ™‚

Downloads last month
10
Safetensors
Model size
684k params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support