AudioLLM
AudioLLM is a multimodal model that combines Whisper's audio encoding capabilities with LLaMA's text generation to create a model that can understand audio and respond with text.
Model Details
- Model Type: AudioLLM
- Base LLM Model: meta-llama/Llama-3.2-3B-Instruct
- Audio Encoder: openai/whisper-large-v3-turbo
- License: MIT
Usage
This model integrates with the standard Hugging Face Pipeline API:
import torch
from transformers import pipeline
# Load the pipeline
audio_llm = pipeline(
"text-generation",
model="cdreetz/audio-llama-hf",
device="cuda" if torch.cuda.is_available() else "cpu"
)
# Process audio file
result = audio_llm("path/to/audio.wav")
print(result[0]["generated_text"])
# Process audio with custom prompt
result = audio_llm(("path/to/audio.wav", "Describe the music in this audio:"))
print(result[0]["generated_text"])
# Text-only generation
result = audio_llm("Write a poem about sound:")
print(result[0]["generated_text"])
See example.py
for more advanced usage examples.
Limitations
- Maximum audio length is limited to 30 seconds
- Audio quality may affect performance
Credits
This model combines the architecture of OpenAI's Whisper for audio understanding and Meta's LLaMA for text generation.
- Downloads last month
- 5
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
HF Inference deployability: The model has no library tag.