Spaces:
Running
on
A10G
Running
on
A10G
# AudioCraft conditioning modules | |
AudioCraft provides a | |
[modular implementation of conditioning modules](../audiocraft/modules/conditioners.py) | |
that can be used with the language model to condition the generation. | |
The codebase was developed in order to easily extend the set of modules | |
currently supported to easily develop new ways of controlling the generation. | |
## Conditioning methods | |
For now, we support 3 main types of conditioning within AudioCraft: | |
* Text-based conditioning methods | |
* Waveform-based conditioning methods | |
* Joint embedding conditioning methods for text and audio projected in a shared latent space. | |
The Language Model relies on 2 core components that handle processing information: | |
* The `ConditionProvider` class, that maps metadata to processed conditions leveraging | |
all the defined conditioners for the given task. | |
* The `ConditionFuser` class, that takes preprocessed conditions and properly fuse the | |
conditioning embedding to the language model inputs following a given fusing strategy. | |
Different conditioners (for text, waveform, joint embeddings...) are provided as torch | |
modules in AudioCraft and are used internally in the language model to process the | |
conditioning signals and feed them to the language model. | |
## Core concepts | |
### Conditioners | |
The `BaseConditioner` torch module is the base implementation for all conditioners in audiocraft. | |
Each conditioner is expected to implement 2 methods: | |
* The `tokenize` method that is used as a preprocessing method that contains all processing | |
that can lead to synchronization points (e.g. BPE tokenization with transfer to the GPU). | |
The output of the tokenize method will then be used to feed the forward method. | |
* The `forward` method that takes the output of the tokenize method and contains the core computation | |
to obtain the conditioning embedding along with a mask indicating valid indices (e.g. padding tokens). | |
### ConditionProvider | |
The ConditionProvider prepares and provides conditions given a dictionary of conditioners. | |
Conditioners are specified as a dictionary of attributes and the corresponding conditioner | |
providing the processing logic for the given attribute. | |
Similarly to the conditioners, the condition provider works in two steps to avoid sychronization points: | |
* A `tokenize` method that takes a list of conditioning attributes for the batch, | |
and run all tokenize steps for the set of conditioners. | |
* A `forward` method that takes the output of the tokenize step and run all the forward steps | |
for the set of conditioners. | |
The list of conditioning attributes is passed as a list of `ConditioningAttributes` | |
that is presented just below. | |
### ConditionFuser | |
Once all conditioning signals have been extracted and processed by the `ConditionProvider` | |
as dense embeddings, they remain to be passed to the language model along with the original | |
language model inputs. | |
The `ConditionFuser` handles specifically the logic to combine the different conditions | |
to the actual model input, supporting different strategies to combine them. | |
One can therefore define different strategies to combine or fuse the condition to the input, in particular: | |
* Prepending the conditioning signal to the input with the `prepend` strategy, | |
* Summing the conditioning signal to the input with the `sum` strategy, | |
* Combining the conditioning relying on a cross-attention mechanism with the `cross` strategy, | |
* Using input interpolation with the `input_interpolate` strategy. | |
### SegmentWithAttributes and ConditioningAttributes: From metadata to conditions | |
The `ConditioningAttributes` dataclass is the base class for metadata | |
containing all attributes used for conditioning the language model. | |
It currently supports the following types of attributes: | |
* Text conditioning attributes: Dictionary of textual attributes used for text-conditioning. | |
* Wav conditioning attributes: Dictionary of waveform attributes used for waveform-based | |
conditioning such as the chroma conditioning. | |
* JointEmbed conditioning attributes: Dictionary of text and waveform attributes | |
that are expected to be represented in a shared latent space. | |
These different types of attributes are the attributes that are processed | |
by the different conditioners. | |
`ConditioningAttributes` are extracted from metadata loaded along the audio in the datasets, | |
provided that the metadata used by the dataset implements the `SegmentWithAttributes` abstraction. | |
All metadata-enabled datasets to use for conditioning in AudioCraft inherits | |
the [`audiocraft.data.info_dataset.InfoAudioDataset`](../audiocraft/data/info_audio_dataset.py) class | |
and the corresponding metadata inherits and implements the `SegmentWithAttributes` abstraction. | |
Refer to the [`audiocraft.data.music_dataset.MusicAudioDataset`](../audiocraft/data/music_dataset.py) | |
class as an example. | |
## Available conditioners | |
### Text conditioners | |
All text conditioners are expected to inherit from the `TextConditioner` class. | |
AudioCraft currently provides two text conditioners: | |
* The `LUTConditioner` that relies on look-up-table of embeddings learned at train time, | |
and relying on either no tokenizer or a spacy tokenizer. This conditioner is particularly | |
useful for simple experiments and categorical labels. | |
* The `T5Conditioner` that relies on a | |
[pre-trained T5 model](https://huggingface.co/docs/transformers/model_doc/t5) | |
frozen or fine-tuned at train time to extract the text embeddings. | |
### Waveform conditioners | |
All waveform conditioners are expected to inherit from the `WaveformConditioner` class and | |
consists of conditioning method that takes a waveform as input. The waveform conditioner | |
must implement the logic to extract the embedding from the waveform and define the downsampling | |
factor from the waveform to the resulting embedding. | |
The `ChromaStemConditioner` conditioner is a waveform conditioner for the chroma features | |
conditioning used by MusicGen. It takes a given waveform, extract relevant stems for melody | |
(namely all non drums and bass stems) using a | |
[pre-trained Demucs model](https://github.com/facebookresearch/demucs) | |
and then extract the chromagram bins from the remaining mix of stems. | |
### Joint embeddings conditioners | |
We finally provide support for conditioning based on joint text and audio embeddings through | |
the `JointEmbeddingConditioner` class and the `CLAPEmbeddingConditioner` that implements such | |
a conditioning method relying on a [pretrained CLAP model](https://github.com/LAION-AI/CLAP). | |
## Classifier Free Guidance | |
We provide a Classifier Free Guidance implementation in AudioCraft. With the classifier free | |
guidance dropout, all attributes are dropped with the same probability. | |
## Attribute Dropout | |
We further provide an attribute dropout strategy. Unlike the classifier free guidance dropout, | |
the attribute dropout drops given attributes with a defined probability, allowing the model | |
not to expect all conditioning signals to be provided at once. | |
## Faster computation of conditions | |
Conditioners that require some heavy computation on the waveform can be cached, in particular | |
the `ChromaStemConditioner` or `CLAPEmbeddingConditioner`. You just need to provide the | |
`cache_path` parameter to them. We recommend running dummy jobs for filling up the cache quickly. | |
An example is provied in the [musicgen.musicgen_melody_32khz grid](../audiocraft/grids/musicgen/musicgen_melody_32khz.py). |