SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

Chunbo Hao^1, Ruibin Yuan^2,5, Jixun Yao¹, Qixin Deng^3,5,
Xinyi Bai^4,5, Wei Xue², Lei Xie^1†

^*Equal contribution ^†Corresponding author

¹Audio, Speech and Language Processing Group (ASLP@NPU),
Northwestern Polytechnical University
²Hong Kong University of Science and Technology
³Northwestern University
⁴Cornell University
⁵Multimodal Art Projection (M-A-P)

SongFormer is a music structure analysis framework that leverages multi-resolution self-supervised representations and heterogeneous supervision, accompanied by the large-scale multilingual dataset SongFormDB and the high-quality benchmark SongFormBench to foster fair and reproducible research.

For a more detailed deployment guide, please refer to the GitHub repository.

🚀 QuickStart

Prerequisites

Before running the model, follow the instructions in the GitHub repository to set up the required Python environment.

Input: Audio File Path

You can perform inference by providing the path to an audio file:

from transformers import AutoModel
from huggingface_hub import snapshot_download
import sys
import os

# Download the model from Hugging Face Hub
local_dir = snapshot_download(
    repo_id="ASLP-lab/SongFormer",
    repo_type="model",
    local_dir_use_symlinks=False,
    resume_download=True,
    allow_patterns="*",
    ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"],
)

# Add the local directory to path and set environment variable
sys.path.append(local_dir)
os.environ["SONGFORMER_LOCAL_DIR"] = local_dir

# Load the model
songformer = AutoModel.from_pretrained(
    local_dir,
    trust_remote_code=True,
    low_cpu_mem_usage=False,
)

# Set device and switch to evaluation mode
device = "cuda:0"
songformer.to(device)
songformer.eval()

# Run inference
result = songformer("path/to/audio/file.wav")

Input: Tensor or NumPy Array

Alternatively, you can directly feed a raw audio waveform as a NumPy array or PyTorch tensor:

from transformers import AutoModel
from huggingface_hub import snapshot_download
import sys
import os
import numpy as np

# Download model
local_dir = snapshot_download(
    repo_id="ASLP-lab/SongFormer",
    repo_type="model",
    local_dir_use_symlinks=False,
    resume_download=True,
    allow_patterns="*",
    ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"],
)

# Setup environment
sys.path.append(local_dir)
os.environ["SONGFORMER_LOCAL_DIR"] = local_dir

# Load model
songformer = AutoModel.from_pretrained(
    local_dir,
    trust_remote_code=True,
    low_cpu_mem_usage=False,
)

# Configure device
device = "cuda:0"
songformer.to(device)
songformer.eval()

# Generate dummy audio input (sampling rate: 24,000 Hz, e.g., 60 seconds of audio)
audio = np.random.randn(24000 * 60).astype(np.float32)

# Perform inference
result = songformer(audio)

⚠️ Note: The expected sampling rate for input audio is 24,000 Hz.

Output Format

The model returns a structured list of segment predictions, with each entry containing timing and label information:

[
  {
    "start": 0.0,          // Start time of segment (in seconds)
    "end": 15.2,           // End time of segment (in seconds)
    "label": "verse"       // Predicted segment label
  },
  ...
]

🔧 Notes

The initialization logic of MusicFM has been modified to eliminate the need for loading checkpoint files during instantiation, improving both reliability and startup efficiency.

📚 Citation

If you use SongFormer in your research or application, please cite our work:

@misc{hao2025songformer,
  title         = {SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision},
  author        = {Chunbo Hao and Ruibin Yuan and Jixun Yao and Qixin Deng and Xinyi Bai and Wei Xue and Lei Xie},
  year          = {2025},
  eprint        = {2510.02797},
  archivePrefix = {arXiv},
  primaryClass  = {eess.AS},
  url           = {https://arxiv.org/abs/2510.02797}
}

Downloads last month: 56

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

ASLP-lab
/

SongFormer

SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

Chunbo Hao^1, Ruibin Yuan^2,5, Jixun Yao¹, Qixin Deng^3,5,
Xinyi Bai^4,5, Wei Xue², Lei Xie^1†

🚀 QuickStart

Prerequisites

Input: Audio File Path

Input: Tensor or NumPy Array

Output Format

🔧 Notes

📚 Citation

Space using ASLP-lab/SongFormer 1

Collection including ASLP-lab/SongFormer

SongFormer

SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

Chunbo Hao1*, Ruibin Yuan2,5*, Jixun Yao1, Qixin Deng3,5,Xinyi Bai4,5, Wei Xue2, Lei Xie1†

🚀 QuickStart

Prerequisites

Input: Audio File Path

Input: Tensor or NumPy Array

Output Format

🔧 Notes

📚 Citation

Space using ASLP-lab/SongFormer 1

Collection including ASLP-lab/SongFormer

Chunbo Hao^1, Ruibin Yuan^2,5, Jixun Yao¹, Qixin Deng^3,5,
Xinyi Bai^4,5, Wei Xue², Lei Xie^1†