SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision
Chunbo Hao1*, Ruibin Yuan2,5*, Jixun Yao1, Qixin Deng3,5,
Xinyi Bai4,5, Wei Xue2, Lei Xie1β
*Equal contribution β Corresponding author
1Audio, Speech and Language Processing Group (ASLP@NPU),
Northwestern Polytechnical University
2Hong Kong University of Science and Technology
3Northwestern University
4Cornell University
5Multimodal Art Projection (M-A-P)
SongFormer is a music structure analysis framework that leverages multi-resolution self-supervised representations and heterogeneous supervision, accompanied by the large-scale multilingual dataset SongFormDB and the high-quality benchmark SongFormBench to foster fair and reproducible research.
For a more detailed deployment guide, please refer to the GitHub repository.
π QuickStart
Prerequisites
Before running the model, follow the instructions in the GitHub repository to set up the required Python environment.
Input: Audio File Path
You can perform inference by providing the path to an audio file:
from transformers import AutoModel
from huggingface_hub import snapshot_download
import sys
import os
# Download the model from Hugging Face Hub
local_dir = snapshot_download(
repo_id="ASLP-lab/SongFormer",
repo_type="model",
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns="*",
ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"],
)
# Add the local directory to path and set environment variable
sys.path.append(local_dir)
os.environ["SONGFORMER_LOCAL_DIR"] = local_dir
# Load the model
songformer = AutoModel.from_pretrained(
local_dir,
trust_remote_code=True,
low_cpu_mem_usage=False,
)
# Set device and switch to evaluation mode
device = "cuda:0"
songformer.to(device)
songformer.eval()
# Run inference
result = songformer("path/to/audio/file.wav")
Input: Tensor or NumPy Array
Alternatively, you can directly feed a raw audio waveform as a NumPy array or PyTorch tensor:
from transformers import AutoModel
from huggingface_hub import snapshot_download
import sys
import os
import numpy as np
# Download model
local_dir = snapshot_download(
repo_id="ASLP-lab/SongFormer",
repo_type="model",
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns="*",
ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"],
)
# Setup environment
sys.path.append(local_dir)
os.environ["SONGFORMER_LOCAL_DIR"] = local_dir
# Load model
songformer = AutoModel.from_pretrained(
local_dir,
trust_remote_code=True,
low_cpu_mem_usage=False,
)
# Configure device
device = "cuda:0"
songformer.to(device)
songformer.eval()
# Generate dummy audio input (sampling rate: 24,000 Hz, e.g., 60 seconds of audio)
audio = np.random.randn(24000 * 60).astype(np.float32)
# Perform inference
result = songformer(audio)
β οΈ Note: The expected sampling rate for input audio is 24,000 Hz.
Output Format
The model returns a structured list of segment predictions, with each entry containing timing and label information:
[
{
"start": 0.0, // Start time of segment (in seconds)
"end": 15.2, // End time of segment (in seconds)
"label": "verse" // Predicted segment label
},
...
]
π§ Notes
- The initialization logic of MusicFM has been modified to eliminate the need for loading checkpoint files during instantiation, improving both reliability and startup efficiency.
π Citation
If you use SongFormer in your research or application, please cite our work:
@misc{hao2025songformer,
title = {SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision},
author = {Chunbo Hao and Ruibin Yuan and Jixun Yao and Qixin Deng and Xinyi Bai and Wei Xue and Lei Xie},
year = {2025},
eprint = {2510.02797},
archivePrefix = {arXiv},
primaryClass = {eess.AS},
url = {https://arxiv.org/abs/2510.02797}
}
- Downloads last month
- 56