CultureMERT: Continual Pre-Training for Cross-Cultural Music Representation Learning

πŸ“‘ Read the full paper (to be presented at ISMIR 2025)

CultureMERT-TA-95M is a 95M-parameter music foundation model adapted to diverse musical cultures through task arithmetic. Instead of direct continual pre-training on a multi-cultural mixture, as in CultureMERT-95M, this model merges multiple single-culture adapted variants of MERT-v1-95Mβ€”each continually pre-trained via our two-stage strategy on a distinct musical tradition:

Dataset Music Tradition Hours Used
Lyra Greek traditional/folk 50h
Turkish-makam Turkish/Ottoman classical 200h
Hindustani North Indian classical 200h
Carnatic South Indian classical 200h

πŸ§ͺ The final model was merged using a scaling factor of Ξ» = 0.2, which yielded the best overall performance across all task arithmetic variants evaluated.

πŸ”€ This model serves as an alternative to CultureMERT-95M. It merges culturally specialized models in weight space via task arithmetic to form a unified multi-cultural model. Each single-culture adapted model is obtained using the same two-stage continual pre-training strategy as CultureMERT-95M, applied separately to each musical tradition prior to merging.


πŸ“Š Evaluation

We follow the same evaluation protocol as CultureMERT-95M and report results in comparison to both it and MERT-v1-95M:

ROC-AUC / mAP

Turkish-makam Hindustani Carnatic Lyra FMA-medium MTAT Avg.
MERT-v1-95M 83.2% / 53.3% 82.4% / 52.9% 74.9% / 39.7% 85.7% / 56.5% 90.7% / 48.1% 89.6% / 35.9% 66.1%
CultureMERT-95M 89.6% / 60.6% 88.2% / 63.5% 79.2% / 43.1% 86.9% / 56.7% 90.7% / 48.1% 89.4% / 35.9% 69.3%
CultureMERT-TA-95M 89.0% / 61.0% 87.5% / 59.3% 79.1% / 43.3% 87.3% / 57.3% 90.8% / 49.1% 89.6% / 36.4% 69.1%

Micro-F1 / Macro-F1

Turkish-makam Hindustani Carnatic Lyra FMA-medium MTAT Avg.
MERT-v1-95M 73.0% / 38.9% 71.1% / 33.2% 80.1% / 30.0% 72.4% / 42.6% 57.0% / 36.9% 35.7% / 21.2% 49.3%
CultureMERT-95M 77.4% / 45.8% 77.8% / 50.4% 82.7% / 32.5% 73.1% / 43.1% 58.3% / 36.6% 35.6% / 22.9% 52.9%
CultureMERT-TA-95M 76.9% / 45.4% 74.2% / 45.0% 82.5% / 32.1% 73.0% / 45.3% 59.1% / 38.2% 35.7% / 21.5% 52.4%

πŸ“ˆ CultureMERT-TA-95M performs comparably to CultureMERT-95M on non-Western datasets, while surpassing it on Lyra and Western benchmarks. It also outperforms MERT-v1-95M on Western tasks (MTAT and FMA-medium) by an average margin of +0.7% across all metrics.


πŸ”§ Model Usage

from transformers import Wav2Vec2FeatureExtractor, AutoModel
import torch
from torch import nn
import torchaudio.transforms as T
from datasets import load_dataset

# Load model weights and preprocessor config
model = AutoModel.from_pretrained("ntua-slp/CultureMERT-TA-95M", trust_remote_code=True)
processor = Wav2Vec2FeatureExtractor.from_pretrained("ntua-slp/CultureMERT-TA-95M", trust_remote_code=True)

# Load example audio
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation", trust_remote_code=True).sort("id")
audio_array = dataset[0]["audio"]["array"]
sampling_rate = dataset.features["audio"].sampling_rate

# Resample if needed
resample_rate = processor.sampling_rate
if resample_rate != sampling_rate:
    print(f'Setting sample rate from {sampling_rate} to {resample_rate}')
    resampler = T.Resample(sampling_rate, resample_rate)
else:
    resampler = None

# Audio file is decoded on the fly
if resampler is None:
    input_audio = dataset[0]["audio"]["array"]
else:
  input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]).to(dtype=resampler.kernel.dtype))

# Extract hidden states
inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# Representations: 13 layers (CNN feature extractor + 12 Transformer)
# NOTE: each layer performs differently in different downstream tasks - you should choose empirically
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
print(all_layer_hidden_states.shape) # [13 layers, Time steps, 768 feature_dim]

# For utterance-level classification tasks, you can simply reduce the representation in time
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
print(time_reduced_hidden_states.shape) # [13, 768]

# You can even use a learnable weighted average representation over all layers
aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
print(weighted_avg_hidden_states.shape) # [768]

Ethical Considerations

This model is released under a non-commercial CC BY-NC 4.0 license and is intended for research purposes. While it is designed to address cultural bias in MIR, its training data and pretraining paradigm may still reflect cultural and dataset-specific biases. The model should not be used in commercial or generative applications without explicit consideration of cultural representation, proper attribution, and consent from relevant communities or dataset curators.

πŸ“š Citation

@misc{kanatas2025culturemertcontinualpretrainingcrosscultural,
      title={CultureMERT: Continual Pre-Training for Cross-Cultural Music Representation Learning}, 
      author={Angelos-Nikolaos Kanatas and Charilaos Papaioannou and Alexandros Potamianos},
      year={2025},
      eprint={2506.17818},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2506.17818}, 
}

Downloads last month
13
Safetensors
Model size
94.4M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ntua-slp/CultureMERT-TA-95M

Base model

m-a-p/MERT-v1-95M
Finetuned
(3)
this model

Collection including ntua-slp/CultureMERT-TA-95M