Update model_index.json to use GPT2LMHeadModel rather than GPT2Model
#5
by
Mekadrom
- opened
On the latest transformers and diffusers libraries, this model does not work when invoked via pipeline AudioLDM2Pipeline.from_pretrained('cvssp/audioldm2')
with the following error messages:
Expected types for language_model: (<class 'transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel'>,), got <class 'transformers.models.gpt2.modeling_gpt2.GPT2Model'>.
Traceback (most recent call last):
File "/home/dev/projects/megatransformer/inference/audio_generate_gr.py", line 48, in generate_audio
audio = pipe(
File "/home/dev/projects/megatransformer/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/dev/projects/megatransformer/venv/lib/python3.10/site-packages/diffusers/pipelines/audioldm2/pipeline_audioldm2.py", line 995, in __call__
prompt_embeds, attention_mask, generated_prompt_embeds = self.encode_prompt(
File "/home/dev/projects/megatransformer/venv/lib/python3.10/site-packages/diffusers/pipelines/audioldm2/pipeline_audioldm2.py", line 518, in encode_prompt
generated_prompt_embeds = self.generate_language_model(
File "/home/dev/projects/megatransformer/venv/lib/python3.10/site-packages/diffusers/pipelines/audioldm2/pipeline_audioldm2.py", line 316, in generate_language_model
model_kwargs = self.language_model._get_initial_cache_position(inputs_embeds, model_kwargs)
File "/home/dev/projects/megatransformer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in __getattr__
raise AttributeError(
AttributeError: 'GPT2Model' object has no attribute '_get_initial_cache_position'
Upon switching the GPT2Model to GPT2LMHeadModel, the following min-repro script will work:
from diffusers import AudioLDM2Pipeline
import torch
import torchaudio
import numpy as np
import os
os.makedirs(os.path.join('inference', 'generated'), exist_ok=True)
model_id = "cvssp/audioldm2"
pipe = AudioLDM2Pipeline.from_pretrained(model_id)
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipe.to(device)
def generate_audio(
prompt,
negative_prompt="",
num_inference_steps=200,
audio_length_in_s=5.0,
guidance_scale=3.5,
seed=None
):
"""
Generate audio from text prompt using AudioLDM 2
Args:
prompt: Text description of the desired audio
negative_prompt: Text describing what to avoid in generation
num_inference_steps: Number of denoising steps (higher = better quality, slower generation)
audio_length_in_s: Duration of generated audio in seconds
guidance_scale: How closely to follow the prompt (higher = more faithful but less diverse)
seed: Random seed for reproducibility
Returns:
Path to generated audio file and raw waveform
"""
try:
if seed is not None:
generator = torch.Generator(device=device).manual_seed(int(seed))
else:
generator = None
audio = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=num_inference_steps,
audio_length_in_s=audio_length_in_s,
guidance_scale=guidance_scale,
generator=generator,
).audios[0]
sample_rate = 16000
filename = os.path.join('inference', 'generated', f"{prompt[:20].replace(' ', '_')}_{seed if seed else 'random'}.wav")
audio_normalized = audio / np.max(np.abs(audio))
torchaudio.save(
filename,
torch.tensor(audio_normalized).unsqueeze(0),
sample_rate
)
except Exception as e:
raise RuntimeError(f"Error generating audio: {str(e)}")
if __name__ == '__main__':
generate_audio("A man speaking clearly about the weather forecast", "", 200, 5.0, 3.5, 42)
If this is undesirable for whatever reason, please let me know. Here is my env info (via diffusers-cli env
) for reproducibility, if necessary:
- π€ Diffusers version: 0.34.0.dev0 (built from source)
- Platform: Linux-6.8.0-57-generic-x86_64-with-glibc2.35
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.30.2
- Transformers version: 4.52.0.dev0 (built from source)
- Accelerate version: 1.5.1
- PEFT version: 0.14.0
- Bitsandbytes version: 0.45.2
- Safetensors version: 0.5.3
- xFormers version: 0.0.29.post2
- Accelerator: NVIDIA GeForce RTX 4090, 24564 MiB
NVIDIA GeForce RTX 4090, 24564 MiB
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
This pr is a duplicate of the previously closed PR to fix a typo in the commit message (huggingface doesn't allow force pushes on pr branches made from the web interface).