SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

Python License arXiv Paper GitHub HuggingFace Space HuggingFace Model Dataset SongFormDB Dataset SongFormBench Discord lab

Chunbo Hao1*, Ruibin Yuan2,5*, Jixun Yao1, Qixin Deng3,5,
Xinyi Bai4,5, Wei Xue2, Lei Xie1†

*Equal contribution    †Corresponding author

1Audio, Speech and Language Processing Group (ASLP@NPU),
Northwestern Polytechnical University
2Hong Kong University of Science and Technology
3Northwestern University
4Cornell University
5Multimodal Art Projection (M-A-P)


SongFormer is a music structure analysis framework that leverages multi-resolution self-supervised representations and heterogeneous supervision, accompanied by the large-scale multilingual dataset SongFormDB and the high-quality benchmark SongFormBench to foster fair and reproducible research.

For a more detailed deployment guide, please refer to the GitHub repository.

πŸš€ QuickStart

Prerequisites

Before running the model, follow the instructions in the GitHub repository to set up the required Python environment.


Input: Audio File Path

You can perform inference by providing the path to an audio file:

from transformers import AutoModel
from huggingface_hub import snapshot_download
import sys
import os

# Download the model from Hugging Face Hub
local_dir = snapshot_download(
    repo_id="ASLP-lab/SongFormer",
    repo_type="model",
    local_dir_use_symlinks=False,
    resume_download=True,
    allow_patterns="*",
    ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"],
)

# Add the local directory to path and set environment variable
sys.path.append(local_dir)
os.environ["SONGFORMER_LOCAL_DIR"] = local_dir

# Load the model
songformer = AutoModel.from_pretrained(
    local_dir,
    trust_remote_code=True,
    low_cpu_mem_usage=False,
)

# Set device and switch to evaluation mode
device = "cuda:0"
songformer.to(device)
songformer.eval()

# Run inference
result = songformer("path/to/audio/file.wav")

Input: Tensor or NumPy Array

Alternatively, you can directly feed a raw audio waveform as a NumPy array or PyTorch tensor:

from transformers import AutoModel
from huggingface_hub import snapshot_download
import sys
import os
import numpy as np

# Download model
local_dir = snapshot_download(
    repo_id="ASLP-lab/SongFormer",
    repo_type="model",
    local_dir_use_symlinks=False,
    resume_download=True,
    allow_patterns="*",
    ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"],
)

# Setup environment
sys.path.append(local_dir)
os.environ["SONGFORMER_LOCAL_DIR"] = local_dir

# Load model
songformer = AutoModel.from_pretrained(
    local_dir,
    trust_remote_code=True,
    low_cpu_mem_usage=False,
)

# Configure device
device = "cuda:0"
songformer.to(device)
songformer.eval()

# Generate dummy audio input (sampling rate: 24,000 Hz, e.g., 60 seconds of audio)
audio = np.random.randn(24000 * 60).astype(np.float32)

# Perform inference
result = songformer(audio)

⚠️ Note: The expected sampling rate for input audio is 24,000 Hz.


Output Format

The model returns a structured list of segment predictions, with each entry containing timing and label information:

[
  {
    "start": 0.0,          // Start time of segment (in seconds)
    "end": 15.2,           // End time of segment (in seconds)
    "label": "verse"       // Predicted segment label
  },
  ...
]

πŸ”§ Notes

  • The initialization logic of MusicFM has been modified to eliminate the need for loading checkpoint files during instantiation, improving both reliability and startup efficiency.

πŸ“š Citation

If you use SongFormer in your research or application, please cite our work:

@misc{hao2025songformer,
  title         = {SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision},
  author        = {Chunbo Hao and Ruibin Yuan and Jixun Yao and Qixin Deng and Xinyi Bai and Wei Xue and Lei Xie},
  year          = {2025},
  eprint        = {2510.02797},
  archivePrefix = {arXiv},
  primaryClass  = {eess.AS},
  url           = {https://arxiv.org/abs/2510.02797}
}
Downloads last month
56
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using ASLP-lab/SongFormer 1

Collection including ASLP-lab/SongFormer