AudioLLM

AudioLLM is a multimodal model that combines Whisper's audio encoding capabilities with LLaMA's text generation to create a model that can understand audio and respond with text.

Model Details

Model Type: AudioLLM
Base LLM Model: meta-llama/Llama-3.2-3B-Instruct
Audio Encoder: openai/whisper-large-v3-turbo
License: MIT

Usage

This model integrates with the standard Hugging Face Pipeline API:

import torch
from transformers import pipeline

# Load the pipeline
audio_llm = pipeline(
    "text-generation",
    model="cdreetz/audio-llama-hf",
    device="cuda" if torch.cuda.is_available() else "cpu"
)

# Process audio file
result = audio_llm("path/to/audio.wav")
print(result[0]["generated_text"])

# Process audio with custom prompt
result = audio_llm(("path/to/audio.wav", "Describe the music in this audio:"))
print(result[0]["generated_text"])

# Text-only generation
result = audio_llm("Write a poem about sound:")
print(result[0]["generated_text"])

See example.py for more advanced usage examples.

Limitations

Maximum audio length is limited to 30 seconds
Audio quality may affect performance

Credits

This model combines the architecture of OpenAI's Whisper for audio understanding and Meta's LLaMA for text generation.