ARIA - Artistic Rendering of Images into Audio

ARIA is a multimodal AI model that generates MIDI music based on the emotional content of artwork. It uses a CLIP-based image encoder to extract emotional valence and arousal from images, then generates emotionally appropriate music using conditional MIDI generation.

Model Description

Developed by: Vincent Amato
Model type: Multimodal (Image-to-MIDI) Generation
Language(s): English
License: MIT
Parent Model: Uses CLIP for image encoding and midi-emotion for music generation
Repository: GitHub

Model Architecture

ARIA consists of two main components:

A CLIP-based image encoder fine-tuned to predict emotional valence and arousal from images
A transformer-based MIDI generation model (midi-emotion) that conditions on these emotional values

The model offers three different conditioning modes:

`continuous_concat` (Recommended)

Creates a single vector from valence and arousal values, repeats it across the sequence, and concatenates it with every music token embedding. This approach gives the emotion information global influence throughout the entire generation process, allowing the transformer to access emotional context at every timestep. Research shows this method achieves the best performance in both note prediction accuracy and emotional coherence.

`continuous_token`

Converts each emotion value (valence and arousal) into separate condition vectors with the same length as music token embeddings, then concatenates them in the sequence dimension. The emotion vectors are inserted at the beginning of the input sequence during generation. This treats emotions similarly to music tokens but can lose influence as the sequence grows longer.

`discrete_token`

Quantizes continuous emotion values into 5 discrete bins (very low, low, moderate, high, very high) and converts them into control tokens. These tokens are placed before the music tokens in the sequence. While this represents the current state-of-the-art approach in conditional text generation, it suffers from information loss due to binning and can lose emotional context during longer generations when tokens are truncated.

Usage

The repository contains three variants of the MIDI generation model, each trained with a different conditioning strategy. Each variant includes:

model.pt: The trained model weights
mappings.pt: Token mappings for MIDI generation
model_config.pt: Model configuration

Additionally, image_encoder.pt contains the CLIP-based image emotion encoder.

Intended Use

This model is designed for:

Generating music that matches the emotional content of artwork
Exploring emotional transfer between visual and musical domains
Creative applications in art and music generation

Limitations

Music generation quality depends on the emotional interpretation of input images
Generated MIDI may require human curation for professional use
Model's emotional understanding is limited to valence-arousal space

Training Data

The model combines:

Image encoder: Fine-tuned on a curated dataset of artwork with emotional annotations
MIDI generation: Uses the Lakh-Spotify dataset as processed by the midi-emotion project

Attribution

This project builds upon:

midi-emotion by Serkan Sulun et al. (GitHub)
- Paper: "Symbolic music generation conditioned on continuous-valued emotions" (IEEE Access)
- Citation: S. Sulun, M. E. P. Davies and P. Viana, "Symbolic Music Generation Conditioned on Continuous-Valued Emotions," in IEEE Access, vol. 10, pp. 44617-44626, 2022
CLIP by OpenAI for the base image encoder architecture

License

This model is released under the MIT License. However, usage of the midi-emotion component should comply with its GPL-3.0 license.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

vincentamato
/

ARIA

ARIA - Artistic Rendering of Images into Audio

Model Description

Model Architecture

`continuous_concat` (Recommended)

`continuous_token`

`discrete_token`

Usage

Intended Use

Limitations

Training Data

Attribution

License

Space using vincentamato/ARIA 1

ARIA - Artistic Rendering of Images into Audio

Model Description

Model Architecture

continuous_concat (Recommended)

continuous_token

discrete_token

Usage

Intended Use

Limitations

Training Data

Attribution

License

Space using vincentamato/ARIA 1

`continuous_concat` (Recommended)

`continuous_token`

`discrete_token`