AudioLLM

AudioLLM is a multimodal model that combines Whisper's audio encoding capabilities with LLaMA's text generation to create a model that can understand audio and respond with text.

Model Details

  • Model Type: AudioLLM
  • Base LLM Model: meta-llama/Llama-3.2-3B-Instruct
  • Audio Encoder: openai/whisper-large-v3-turbo
  • License: MIT

Usage

This model integrates with the standard Hugging Face Pipeline API:

import torch
from transformers import pipeline

# Load the pipeline
audio_llm = pipeline(
    "text-generation",
    model="cdreetz/audio-llama-hf",
    device="cuda" if torch.cuda.is_available() else "cpu"
)

# Process audio file
result = audio_llm("path/to/audio.wav")
print(result[0]["generated_text"])

# Process audio with custom prompt
result = audio_llm(("path/to/audio.wav", "Describe the music in this audio:"))
print(result[0]["generated_text"])

# Text-only generation
result = audio_llm("Write a poem about sound:")
print(result[0]["generated_text"])

See example.py for more advanced usage examples.

Limitations

  • Maximum audio length is limited to 30 seconds
  • Audio quality may affect performance

Credits

This model combines the architecture of OpenAI's Whisper for audio understanding and Meta's LLaMA for text generation.

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support