Update README.md
Browse files
README.md
CHANGED
@@ -8,4 +8,73 @@ tags:
|
|
8 |
- audio
|
9 |
- music-generation
|
10 |
- peft
|
11 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
- audio
|
9 |
- music-generation
|
10 |
- peft
|
11 |
+
---
|
12 |
+
|
13 |
+
### Exploring Adapter Design Tradeoffs for Low Resource Music Generation
|
14 |
+
[Code](https://github.com/atharva20038/ACMMM_Adapters/edit/main) | [Models](https://huggingface.co/collections/athi180202/peft-adaptations-of-music-generation-models-684ba077a2a44999bb6cb175) | [Paper](https://arxiv.org/abs/2506.21298)
|
15 |
+
|
16 |
+
This repository contains our code for the paper: "Exploring Adapter Design Tradeoffs for Low Resource Music Generation"
|
17 |
+
|
18 |
+
Fine-tuning large-scale music generation models, such as MusicGen and Mustango, is a computationally expensive process, often requiring updates to billions of parameters and, therefore, significant hardware resources.
|
19 |
+
Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based methods, have emerged as a promising alternative, enabling adaptation with minimal trainable parameters while preserving model performance.
|
20 |
+
However, the design choices for adapters, including their architecture, placement, and size, are numerous, and it is unclear which of these combinations would produce optimal adapters and why, for a given case of low-resource music genre.
|
21 |
+
In this paper, we attempt to answer this question by studying various adapter configurations for two AI music models, MusicGen and Mustango, on two genres: Hindustani Classical and Turkish Makam music.
|
22 |
+
|
23 |
+
## Datasets
|
24 |
+
|
25 |
+
The [Compmusic - Turkish Makam](https://compmusic.upf.edu/datasets) dataset contains 405 hours of Turkish Makam and Hindustani Classical data.
|
26 |
+
|
27 |
+
The [Compmusic - Hindustani Classical](https://compmusic.upf.edu/datasets) dataset contains 305 hours of Hindustani Classical annotated data.
|
28 |
+
|
29 |
+
The Hindustani Classical dataset includes 21 different instrument types, such as the Pakhavaj, Zither, Sarangi, Ghatam, Harmonium,
|
30 |
+
and Santoor, along with vocals.
|
31 |
+
|
32 |
+
The Turkish Makam dataset features 42 makam-specific instruments, such as Oud, Tanbur, Ney, Davul, Clarinet, Kös, Kudüm,
|
33 |
+
Yaylı Tanbur, Tef, Kanun, Zurna, Bendir, Darbuka, Classical Kemençe, Rebab, Çevgen, and vocals. It encompasses 100 different
|
34 |
+
makams and 62 distinct usuls.
|
35 |
+
|
36 |
+
## Adapter Positioning
|
37 |
+
|
38 |
+
<div align="center">
|
39 |
+
<img src="img/Architecture-1.png" width="900"/>
|
40 |
+
</div>
|
41 |
+
|
42 |
+
### Mustango
|
43 |
+
To enhance this process, a Bottleneck Residual Adapter with convolution layers is integrated into the up-sampling, middle, and down-sampling blocks of the UNet, positioned just after the cross-attention block. This design facilitates cultural adaptation while preserving computational efficiency. The adapters reduce channel dimensions by a factor of 8, using a kernel size of 1 and GeLU activation after the down-projection layers to introduce non-linearity.
|
44 |
+
|
45 |
+
### MusicGen
|
46 |
+
In MusicGen, we enhance the model with an additional 2 million parameters by integrating Linear Bottleneck Residual Adapter after the transformer decoder within the MusicGen architecture after thorough experimentation with other placements.
|
47 |
+
|
48 |
+
The total parameter count of both the models is ~2 billion, making the adapter only 0.1% of the total size (2M params).
|
49 |
+
For both models, we used two RTX A6000 GPUs over a period of around 10 hours. The adapter block was fine-tuned, using the AdamW optimizer using MSE (Reconstruction Loss).
|
50 |
+
|
51 |
+
## Evaluations
|
52 |
+
### **Objective Evaluation Metrics for Music Models**
|
53 |
+
<div align="center">
|
54 |
+
<img src="img/fad_fd_image-1.png" width="900"/>
|
55 |
+
</div>
|
56 |
+
|
57 |
+
For Mustango, the objective evaluation results can also be seen in the following google sheet : [Spreadsheet](https://docs.google.com/spreadsheets/d/11aHVjt8zeHyMqmIBIdV5b4pvlu8gc83510HD0nwBrjo/edit?gid=0#gid=0).
|
58 |
+
|
59 |
+
### **Human Evaluation**
|
60 |
+
Hindustani Classical - Subjective Evaluation Results
|
61 |
+
<div align="center">
|
62 |
+
<img src="img/hindustani_quality (1).png" width="900"/>
|
63 |
+
</div>
|
64 |
+
|
65 |
+
Turkish Makam = Subjective Evaluation Results
|
66 |
+
<div align="center">
|
67 |
+
<img src="img/makam (1).png" width="900"/>
|
68 |
+
</div>
|
69 |
+
|
70 |
+
|
71 |
+
## Citation
|
72 |
+
Please consider citing the following article if you found our work useful:
|
73 |
+
```
|
74 |
+
|
75 |
+
|
76 |
+
@misc
|
77 |
+
{
|
78 |
+
|
79 |
+
}
|
80 |
+
```
|