Diffusers
Safetensors
AudioLDM2Pipeline

Update model_index.json to use GPT2LMHeadModel rather than GPT2Model

#5
by Mekadrom - opened

On the latest transformers and diffusers libraries, this model does not work when invoked via pipeline AudioLDM2Pipeline.from_pretrained('cvssp/audioldm2') with the following error messages:

Expected types for language_model: (<class 'transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel'>,), got <class 'transformers.models.gpt2.modeling_gpt2.GPT2Model'>.
Traceback (most recent call last):
  File "/home/dev/projects/megatransformer/inference/audio_generate_gr.py", line 48, in generate_audio
    audio = pipe(
  File "/home/dev/projects/megatransformer/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/dev/projects/megatransformer/venv/lib/python3.10/site-packages/diffusers/pipelines/audioldm2/pipeline_audioldm2.py", line 995, in __call__
    prompt_embeds, attention_mask, generated_prompt_embeds = self.encode_prompt(
  File "/home/dev/projects/megatransformer/venv/lib/python3.10/site-packages/diffusers/pipelines/audioldm2/pipeline_audioldm2.py", line 518, in encode_prompt
    generated_prompt_embeds = self.generate_language_model(
  File "/home/dev/projects/megatransformer/venv/lib/python3.10/site-packages/diffusers/pipelines/audioldm2/pipeline_audioldm2.py", line 316, in generate_language_model
    model_kwargs = self.language_model._get_initial_cache_position(inputs_embeds, model_kwargs)
  File "/home/dev/projects/megatransformer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in __getattr__
    raise AttributeError(
AttributeError: 'GPT2Model' object has no attribute '_get_initial_cache_position'

Upon switching the GPT2Model to GPT2LMHeadModel, the following min-repro script will work:

from diffusers import AudioLDM2Pipeline

import torch
import torchaudio
import numpy as np
import os

os.makedirs(os.path.join('inference', 'generated'), exist_ok=True)

model_id = "cvssp/audioldm2"
pipe = AudioLDM2Pipeline.from_pretrained(model_id)

device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipe.to(device)

def generate_audio(
    prompt,
    negative_prompt="",
    num_inference_steps=200,
    audio_length_in_s=5.0,
    guidance_scale=3.5,
    seed=None
):
    """
    Generate audio from text prompt using AudioLDM 2

    Args:
        prompt: Text description of the desired audio
        negative_prompt: Text describing what to avoid in generation
        num_inference_steps: Number of denoising steps (higher = better quality, slower generation)
        audio_length_in_s: Duration of generated audio in seconds
        guidance_scale: How closely to follow the prompt (higher = more faithful but less diverse)
        seed: Random seed for reproducibility

    Returns:
        Path to generated audio file and raw waveform
    """
    try:
        if seed is not None:
            generator = torch.Generator(device=device).manual_seed(int(seed))
        else:
            generator = None

        audio = pipe(
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=num_inference_steps,
            audio_length_in_s=audio_length_in_s,
            guidance_scale=guidance_scale,
            generator=generator,
        ).audios[0]

        sample_rate = 16000

        filename = os.path.join('inference', 'generated', f"{prompt[:20].replace(' ', '_')}_{seed if seed else 'random'}.wav")
        audio_normalized = audio / np.max(np.abs(audio))
        torchaudio.save(
            filename,
            torch.tensor(audio_normalized).unsqueeze(0),
            sample_rate
        )
    except Exception as e:
        raise RuntimeError(f"Error generating audio: {str(e)}")

if __name__ == '__main__':
    generate_audio("A man speaking clearly about the weather forecast", "", 200, 5.0, 3.5, 42)

If this is undesirable for whatever reason, please let me know. Here is my env info (via diffusers-cli env) for reproducibility, if necessary:

- πŸ€— Diffusers version: 0.34.0.dev0 (built from source)
- Platform: Linux-6.8.0-57-generic-x86_64-with-glibc2.35
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.30.2
- Transformers version: 4.52.0.dev0 (built from source)
- Accelerate version: 1.5.1
- PEFT version: 0.14.0
- Bitsandbytes version: 0.45.2
- Safetensors version: 0.5.3
- xFormers version: 0.0.29.post2
- Accelerator: NVIDIA GeForce RTX 4090, 24564 MiB
NVIDIA GeForce RTX 4090, 24564 MiB
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no

This pr is a duplicate of the previously closed PR to fix a typo in the commit message (huggingface doesn't allow force pushes on pr branches made from the web interface).

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment