ARIA - Artistic Rendering of Images into Audio
ARIA is a multimodal AI model that generates MIDI music based on the emotional content of artwork. It uses a CLIP-based image encoder to extract emotional valence and arousal from images, then generates emotionally appropriate music using conditional MIDI generation.
Model Description
- Developed by: Vincent Amato
- Model type: Multimodal (Image-to-MIDI) Generation
- Language(s): English
- License: MIT
- Parent Model: Uses CLIP for image encoding and midi-emotion for music generation
- Repository: GitHub
Model Architecture
ARIA consists of two main components:
- A CLIP-based image encoder fine-tuned to predict emotional valence and arousal from images
- A transformer-based MIDI generation model (midi-emotion) that conditions on these emotional values
The model offers three different conditioning modes:
continuous_concat
(Recommended)
Creates a single vector from valence and arousal values, repeats it across the sequence, and concatenates it with every music token embedding. This approach gives the emotion information global influence throughout the entire generation process, allowing the transformer to access emotional context at every timestep. Research shows this method achieves the best performance in both note prediction accuracy and emotional coherence.
continuous_token
Converts each emotion value (valence and arousal) into separate condition vectors with the same length as music token embeddings, then concatenates them in the sequence dimension. The emotion vectors are inserted at the beginning of the input sequence during generation. This treats emotions similarly to music tokens but can lose influence as the sequence grows longer.
discrete_token
Quantizes continuous emotion values into 5 discrete bins (very low, low, moderate, high, very high) and converts them into control tokens. These tokens are placed before the music tokens in the sequence. While this represents the current state-of-the-art approach in conditional text generation, it suffers from information loss due to binning and can lose emotional context during longer generations when tokens are truncated.
Usage
The repository contains three variants of the MIDI generation model, each trained with a different conditioning strategy. Each variant includes:
model.pt
: The trained model weightsmappings.pt
: Token mappings for MIDI generationmodel_config.pt
: Model configuration
Additionally, image_encoder.pt
contains the CLIP-based image emotion encoder.
Intended Use
This model is designed for:
- Generating music that matches the emotional content of artwork
- Exploring emotional transfer between visual and musical domains
- Creative applications in art and music generation
Limitations
- Music generation quality depends on the emotional interpretation of input images
- Generated MIDI may require human curation for professional use
- Model's emotional understanding is limited to valence-arousal space
Training Data
The model combines:
- Image encoder: Fine-tuned on a curated dataset of artwork with emotional annotations
- MIDI generation: Uses the Lakh-Spotify dataset as processed by the midi-emotion project
Attribution
This project builds upon:
- midi-emotion by Serkan Sulun et al. (GitHub)
- Paper: "Symbolic music generation conditioned on continuous-valued emotions" (IEEE Access)
- Citation: S. Sulun, M. E. P. Davies and P. Viana, "Symbolic Music Generation Conditioned on Continuous-Valued Emotions," in IEEE Access, vol. 10, pp. 44617-44626, 2022
- CLIP by OpenAI for the base image encoder architecture
License
This model is released under the MIT License. However, usage of the midi-emotion component should comply with its GPL-3.0 license.