SAO Instrumental Finetune

Dataset

The model was trained over a custom synthetized audio dataset. The dataset creation process was the following:

Getting MIDI files: We started with The Lakh MIDI Dataset v0.1, specifically using its "Clean MIDI subset." We removed duplicate songs and split the MIDI files into individual tracks using pretty_midi.
Rendering the MIDI files: Using pedalboard, we loaded VST3 instruments in Python, and sent MIDI events to each instrument to generate a WAV file for each track. The rendered tracks varied in volume without any musical criterion, so we normalized them using pyloudnorm. To combine all instrument audio files into a single track, we used a simple average across all tracks.
Extracting metadata: Each MIDI file is named by song and artist, and from our scripts, we already knew the instruments in use. However, to create descriptive prompts, more data was needed. Using the final mix of each MIDI file, we ran deeprhythm to predict the tempo and essentia's key and scale detector to identify the key and scale. Then, with the song title and artist, we used Spotify’s API to fetch additional metadata, such as energy, acousticness, and instrumentalness. We also retrieved a list of "Sections" from Spotify to split songs (usually 2-4 minutes long) into shorter 30-second clips, ensuring musical coherence and avoiding cuts within phrases. We consulted the LastFM API for user-assigned tags, which typically provided genre information and additional descriptors like "melancholic," "warm," or "gentle," essential for characterizing the mood and style beyond what could be inferred from audio or MIDI files alone. The final result was a JSON file for each audio, which looked like this:

{
  "Song": "Here Comes The Sun - Remastered 2009",
  "Artist": "The Beatles",
  "Tags": [
    "sunshine pop",
    "rock",
    "60s",
    // ...
  ],
  "Instruments": [
    "Acoustic Guitar",
    "Drums"
  ]
  "duration_ms": 185733,
  "acousticness": 0.9339,
  "energy": 0.54,
  "key": "A",
  "mode": "Major",
  "tempo": 128,
  "sections": [
    {
      "start": 0.0,
      "duration": 16.22566
    },
    // ...
  ]
}

Generating prompts: With all metadata gathered, we generated prompts. One option would have been simply concatenating all data, but that would lead to fixed, overly rigid prompts, which would detract from the model’s flexibility to understand natural language. To avoid this, we used Llama 3.1 to transform the JSON metadata into more natural, human-like prompts. Both the system and user prompts were dynamically generated with randomized attribute order to avoid any fixed structure, and we provided a few-shot examples to improve output quality. Here’s an example of a generated prompt:

"A sunshine pop/rock song of the 60s, driven by acoustic guitar and drums, set in the key of A major, with a lively tempo of 128 BPM."

Since all prompts contain similar metadata, we recommend always including in the prompt the genre, some descriptive adjectives, and the instruments you want featured according to the MIDI Instrument Names. It wasn’t trained with every single MIDI instrument, so some will perform better than others. You can also specify tempo, key, and scale. While it responds well to tempo cues, it is less responsive to key and scale adjustments.

Using this pipeline, we created two datasets: a monophonic one with ~10 minutes of each instrument playing solo and a polyphonic one with multiple instruments.

After creating 4 hours of audio, we trained our first model to test the training pipeline, check resources, and assess the impact of finetuning. We trained it for 1 hour on an A100 GPU (40 GB) with batch size = 8, completing 2,000 steps in 36 epochs. This model is saved as first_training_test.ckpt.

The training test was successful; even with just 4 hours of training data and 1 hour of training, listening evaluations indicated that the generations adapted to our dataset. We then generated a few more hours of audio using the same pipeline. However, the synthesized MIDI sounds often had an artificial quality due to the nature of the instruments, note playstyles, mixing, etc. We observed the model starting to generate this artificial sound, which we wanted to avoid. To address this, we created a third dataset using non-copyrighted YouTube content available for commercial use, manually tagging the metadata. The final dataset (monophonic + polyphonic + YouTube) is outlined below:

	Monophonic dataset	Polyphonic dataset	YouTube dataset	Total
# audios	298	399	326	1023
Average duration	28.4 sec.	32.4 sec.	33.6 sec.	-
Total duration	141 min.	215 min.	182 min.	538 min. (9h)

Training

Using all three datasets, we conducted the main training, allowing it to run for 5 hours on an A100 GPU with batch size = 16. In those 5 hours, we completed 4,000 steps over 63 epochs.

Results

Listening test

Our first evaluation was an informal listening comparison between Stable Audio Open (SAO) and our Instrumental Finetune using new prompts following the recommendations stated above. The results were promising: the generated audio better reflected the specified instruments, genre, and overall feel. Here are a few examples across different genres:

Jazz

"An upbeat Jazz piece featuring Swing Drums, Trumpet, and Piano. In the key of B Major at 120 BPM, this track brings a lively energy with playful trumpet melodies and syncopated rhythms that invite listeners to tap their feet."

	Stable Audio Open (SAO)	SAO Instrumental Finetune
Generated audio

"A smooth Jazz track featuring an Acoustic Piano melody, Upright Bass, and soft Jazz Drums with Cymbals. Set in C Minor at 90 BPM, the track evokes a late-night lounge feel with delicate piano improvisation."

	Stable Audio Open (SAO)	SAO Instrumental Finetune
Generated audio

"A smooth Jazz track featuring a mellow Saxophone, soft Jazz Piano, and subtle Upright Bass. Set in G# Major at 90 BPM, this composition evokes a cozy lounge vibe with rich harmonies and a laid-back rhythm that invites relaxation."

	Stable Audio Open (SAO)	SAO Instrumental Finetune
Generated audio

Rock

"A high-energy Rock track featuring Overdriven Electric Guitar, Punchy Drums, and Plucked Bass. Set in A# Major at 140 BPM, this piece showcases a catchy guitar riff and an infectious groove, perfect for driving an adrenaline-fueled atmosphere."

	Stable Audio Open (SAO)	SAO Instrumental Finetune
Generated audio

"A gritty Hard Rock track featuring Overdriven Guitar riffs, Punchy Drums, and Electric Bass. Set in C Minor at 135 BPM, the track delivers an energetic and raw atmosphere, highlighted by a soaring guitar solo."

	Stable Audio Open (SAO)	SAO Instrumental Finetune
Generated audio

"A moody Alternative Rock piece with clean electric guitar and a driving bassline. Set in D Minor at 118 BPM, this track captures a melancholic yet intense vibe, with atmospheric synths and echoing guitar riffs."

	Stable Audio Open (SAO)	SAO Instrumental Finetune
Generated audio

Pop

"A catchy pop track in D Major at 120 BPM featuring bright Synth Chords, punchy Drums, and a bouncy Electric Bassline. The infectious melody and upbeat rhythm create a feel-good vibe."

	Stable Audio Open (SAO)	SAO Instrumental Finetune
Generated audio

"A catchy Synth Pop track featuring bright synth leads, punchy drums, and a driving bassline. In D Minor at 120 BPM, the upbeat rhythm and infectious melody create an energetic, danceable sound."

	Stable Audio Open (SAO)	SAO Instrumental Finetune
Generated audio

Tempo analysis

To evaluate if the model improved at following the requested tempo, we conducted an experiment: We created two prompts for every tempo between 75 and 160 BPM (the range of most music), one for a rock track and one for a jazz track. The genre choice was to observe if tempo adaptation depended on genre, given their unique tempo distributions. After generating the tracks and using deeprhythm to predict their tempos, we calculated Accuracy 1 (percentage within 2% of the expected tempo) and Accuracy 2 (percentage within 2% of either the expected tempo or any ratio of 2x, (1/2)x, 3x, or (1/3)x). Accuracy 2 is common in tempo-prediction models to account for differences in metrical interpretation. Results are as follows:

	Stable Audio Open (SAO)	SAO Instrumental Finetune
Accuracy 1	0.747	0.876
Accuracy 2	0.776	0.953

Results showed a notable improvement in tempo accuracy with our fine-tune, which closely followed the specified tempos. We also calculated accuracies for each genre separately and found consistent results, indicating no strong genre dependency in tempo adherence.

Key and scale analysis

Similar to tempo, we tested key and scale by generating eight prompts for each combination of {C, C#, ..., A#, B} × {Minor, Major} across rock, jazz, hip hop, and pop genres. We measured accuracy for both models, yielding 0.21 for SAO and 0.26 for our finetune. While an improvement, it remains a low score. When analyzing key alone, both models performed similarly, but in cases with the correct key, our model was more likely to also achieve the correct scale. Analysis of the confusion matrix showed our model to be more consistent across keys, e.g., for 16 B-key prompts, SAO generated only 1 while our model generated 6.

Genre dependency was present in key and scale accuracy, e.g., SAO scored 0.125 in hip hop compared to our model’s 0.21, whereas in rock, SAO scored 0.27 and ours 0.31.

The model's limited performance in key and scale accuracy is an expected outcome, as it relies on latent audio representations rather than explicit musical data, like note sequences from MIDI. Unlike models trained directly on MIDI files—which contain clear, structured information on pitch, timing, and other musical elements—this diffusion model must infer these details from complex audio representations. As a result, it doesn’t have built-in access to key, scale, or note information, requiring it to learn music-theory-related patterns implicitly from the data. Additionally, because the dataset includes multiple instruments playing simultaneously, the model faces the challenge of not only learning individual notes but also understanding the harmonic relationships between instruments. Even a minor deviation in the pitch of one instrument can disrupt the overall perception of harmony, resulting in the entire composition sounding off-key or discordant. Thus, to achieve consistent accuracy in key and scale, the model would need to learn both the underlying musical theory and the intricate ways in which instruments interact in a harmonious structure—tasks that are highly challenging for models based solely on audio data rather than symbolic representations like MIDI.

Memorization analysis

Finally, we generated audio using prompts from the training data and compared it to both the original training audio and SAO-generated audio. This tested for memorization and adaptation to the training data. We found no evidence of memorization, and our model demonstrated a strong stylistic shift toward the training data.

Hard Rock

"A fun metal groove with syncopated rhythm featuring distortion guitar, plucked electric bass, and drums in D Minor at 96 BPM. Dive into the edgy soundscape with an intro that sets the stage for a high-energy and intense listening experience."

	Generated audio
Original training audio
Stable Audio Open (SAO)
SAO Instrumental Finetune

"A thunderous djent guitars collide with a dark, loud, and heavy mix in this hard rock track. Featuring distortion guitar, plucked electric bass, and drums, the intense sound is elevated by a higher guitar melody solo. Set in A Major, the track pulses at 149 BPM, delivering a powerful and gripping listening experience."

	Generated audio
Original training audio
Stable Audio Open (SAO)
SAO Instrumental Finetune

Rock / Blues

"A classic Rock Blues track featuring an overdriven guitar and drums, with a blues scale and pentatonic-driven sound. The overdriven guitar delivers a soulful solo improvisation over the energetic drums, creating an electrifying atmosphere. In the key of A Major and set at 117 BPM, this piece exudes a raw and passionate feel, perfectly capturing the essence of blues rock."

	Generated audio
Original training audio
Stable Audio Open (SAO)
SAO Instrumental Finetune

"A lively instrumental cover featuring Electric Clean Guitar, Plucked Electric Bass, and Drums. With fast strumming rhythm guitar and a fun walking bassline, this track captures the classic rock essence with a touch of British flair. Reminiscent of 60s pop rock and oldies, set in E Major at 78 BPM."

	Generated audio
Original training audio
Stable Audio Open (SAO)
SAO Instrumental Finetune

Funk

"A groovy disco track featuring fast funky rhythmic guitars, strings with a violin melody intro, and a brass groove section. Accompanied by drums and plucked electric bass, the Electric Clean Guitar, Brass, Drums, Plucked Electric Bass create a vibrant sound in the key of D Minor at 115 BPM. The 80s influence is evident in this energetic and danceable composition."

	Generated audio
Original training audio
Stable Audio Open (SAO)
SAO Instrumental Finetune

"A funky fast strumming guitar sets the groove, accompanied by drums, plucked electric bass, and guitar licks fills. In the key of C Major with a tempo of 120 BPM, this Funk Blues track exudes a lively and rhythmic vibe."

	Generated audio
Original training audio
Stable Audio Open (SAO)
SAO Instrumental Finetune

Indie

"A dreamy synth melody accompanied by a soothing combination of Synth Pad, Synth Lead, Synth Bass, and Drums creates a relaxing and atmospheric sound. With elements of big reverb, this instrumental cover in E Minor at 129 BPM exudes a chill vibe typical of synth pop, lo-fi, indie, indie pop, psychedelic pop, and dream pop genres."

	Generated audio
Original training audio
Stable Audio Open (SAO)
SAO Instrumental Finetune

"A melancholic Indie track in the key of G Minor with a tempo of 142 BPM featuring Electric Clean Guitar, Plucked Electric Bass, and Drums. With elements of alternative rock and post-punk, this music showcases guitars melodies and rhythm."

	Generated audio
Original training audio
Stable Audio Open (SAO)
SAO Instrumental Finetune

Post Punk

"A post-punk track featuring an electric clean guitar, plucked electric bass, and drums. The leading guitar melody with post-punk pedals and electronic drums create a captivating atmosphere. Follow the bass line as it guides you through the melancholic key of B Minor at a tempo of 80 BPM."

	Generated audio
Original training audio
Stable Audio Open (SAO)
SAO Instrumental Finetune

"A coldish Indie track in G Major at 85 BPM featuring electric clean guitar, plucked electric bass, and drums. With elements of post punk riff and alternative, this song captures the essence of indie rock."

	Generated audio
Original training audio
Stable Audio Open (SAO)
SAO Instrumental Finetune

Lo-fi / Hip Hop

"A dreamy Lo-fi track featuring warm electric piano chords and a chill hop beat, with a bright electric piano melody at the end. Set in the key of E Major at 89 BPM, the soothing sounds of an Electric Grand Piano and Drums create a relaxed and atmospheric vibe."

	Generated audio
Original training audio
Stable Audio Open (SAO)
SAO Instrumental Finetune

"A laid-back Hip Hop track with a dirty piano loop, featuring drums, acoustic grand piano, and plucked electric bass. Set in F Major at 78 BPM, this beat-driven piece showcases a prominent bassline that sets the groove."

	Generated audio
Original training audio
Stable Audio Open (SAO)
SAO Instrumental Finetune

santifiorino
/

SAO-Instrumental-Finetune