|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- declare-lab/mustango |
|
pipeline_tag: text-to-audio |
|
tags: |
|
- music |
|
- audio |
|
- music-generation |
|
- peft |
|
--- |
|
|
|
### Exploring Adapter Design Tradeoffs for Low Resource Music Generation |
|
[Code](https://github.com/atharva20038/ACMMM_Adapters/edit/main) | [Models](https://huggingface.co/collections/athi180202/peft-adaptations-of-music-generation-models-684ba077a2a44999bb6cb175) | [Paper](https://arxiv.org/abs/2506.21298) |
|
|
|
This repository contains our code for the paper: "Exploring Adapter Design Tradeoffs for Low Resource Music Generation" |
|
|
|
Fine-tuning large-scale music generation models, such as MusicGen and Mustango, is a computationally expensive process, often requiring updates to billions of parameters and, therefore, significant hardware resources. |
|
Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based methods, have emerged as a promising alternative, enabling adaptation with minimal trainable parameters while preserving model performance. |
|
However, the design choices for adapters, including their architecture, placement, and size, are numerous, and it is unclear which of these combinations would produce optimal adapters and why, for a given case of low-resource music genre. |
|
In this paper, we attempt to answer this question by studying various adapter configurations for two AI music models, MusicGen and Mustango, on two genres: Hindustani Classical and Turkish Makam music. |
|
|
|
## Datasets |
|
|
|
The [Compmusic - Turkish Makam](https://compmusic.upf.edu/datasets) dataset contains 405 hours of Turkish Makam and Hindustani Classical data. |
|
|
|
The [Compmusic - Hindustani Classical](https://compmusic.upf.edu/datasets) dataset contains 305 hours of Hindustani Classical annotated data. |
|
|
|
The Hindustani Classical dataset includes 21 different instrument types, such as the Pakhavaj, Zither, Sarangi, Ghatam, Harmonium, |
|
and Santoor, along with vocals. |
|
|
|
The Turkish Makam dataset features 42 makam-specific instruments, such as Oud, Tanbur, Ney, Davul, Clarinet, Kös, Kudüm, |
|
Yaylı Tanbur, Tef, Kanun, Zurna, Bendir, Darbuka, Classical Kemençe, Rebab, Çevgen, and vocals. It encompasses 100 different |
|
makams and 62 distinct usuls. |
|
|
|
## Adapter Positioning |
|
|
|
<div align="center"> |
|
<img src="img/Architecture-1.png" width="900"/> |
|
</div> |
|
|
|
### Mustango |
|
To enhance this process, a Bottleneck Residual Adapter with convolution layers is integrated into the up-sampling, middle, and down-sampling blocks of the UNet, positioned just after the cross-attention block. This design facilitates cultural adaptation while preserving computational efficiency. The adapters reduce channel dimensions by a factor of 8, using a kernel size of 1 and GeLU activation after the down-projection layers to introduce non-linearity. |
|
|
|
### MusicGen |
|
In MusicGen, we enhance the model with an additional 2 million parameters by integrating Linear Bottleneck Residual Adapter after the transformer decoder within the MusicGen architecture after thorough experimentation with other placements. |
|
|
|
The total parameter count of both the models is ~2 billion, making the adapter only 0.1% of the total size (2M params). |
|
For both models, we used two RTX A6000 GPUs over a period of around 10 hours. The adapter block was fine-tuned, using the AdamW optimizer using MSE (Reconstruction Loss). |
|
|
|
## Evaluations |
|
### **Objective Evaluation Metrics for Music Models** |
|
<div align="center"> |
|
<img src="img/fad_fd_image-1.png" width="900"/> |
|
</div> |
|
|
|
For Mustango, the objective evaluation results can also be seen in the following google sheet : [Spreadsheet](https://docs.google.com/spreadsheets/d/11aHVjt8zeHyMqmIBIdV5b4pvlu8gc83510HD0nwBrjo/edit?gid=0#gid=0). |
|
|
|
### **Human Evaluation** |
|
Hindustani Classical - Subjective Evaluation Results |
|
<div align="center"> |
|
<img src="img/hindustani_quality (1).png" width="900"/> |
|
</div> |
|
|
|
Turkish Makam = Subjective Evaluation Results |
|
<div align="center"> |
|
<img src="img/makam (1).png" width="900"/> |
|
</div> |
|
|
|
|
|
## Citation |
|
Please consider citing the following article if you found our work useful: |
|
``` |
|
@misc{mehta2025exploringadapterdesigntradeoffs, |
|
title={Exploring Adapter Design Tradeoffs for Low Resource Music Generation}, |
|
author={Atharva Mehta and Shivam Chauhan and Monojit Choudhury}, |
|
year={2025}, |
|
eprint={2506.21298}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.SD}, |
|
url={https://arxiv.org/abs/2506.21298}, |
|
} |
|
``` |