YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
SALMONN Inference Server
A ready-to-use inference server for SALMONN (Speech Audio Language Music Open Neural Network) - a multimodal LLM that can understand speech, audio events, and music.
Features
- Audio transcription
- Question answering about audio content
- Audio description and analysis
- FastAPI server with REST API
- Simple Python API
Requirements
- GPU: NVIDIA GPU with 24GB+ VRAM (L4, A100, RTX 4090, etc.)
- CUDA: 11.8 or higher
- Python: 3.10+
- Storage: ~25GB for model checkpoints
Quick Start
1. Clone and Install
git clone https://huggingface.co/marcosremar2/salmonn-inference
cd salmonn-inference
# Create virtual environment
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
2. Download Models
./install.sh
This downloads (~20GB):
- Vicuna 7B v1.5 (LLM backbone)
- Whisper Large v2 (speech encoder)
- BEATs (audio encoder)
- SALMONN checkpoint (adapter weights)
3. Start Server
python server.py
Server runs at http://localhost:8000
API Usage
Transcribe Audio
curl -X POST "http://localhost:8000/transcribe" \
-F "audio=@your_audio.wav"
Ask Questions
curl -X POST "http://localhost:8000/chat" \
-F "audio=@your_audio.wav" \
-F "question=What is being said in this audio?"
Python API
from inference import SALMONNInference
model = SALMONNInference()
model.load()
# Transcribe
text = model.transcribe("audio.wav")
# Ask questions
answer = model.chat("audio.wav", "What language is being spoken?")
# Describe audio
description = model.describe("audio.wav")
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | API info |
/health |
GET | Health check |
/transcribe |
POST | Transcribe audio to text |
/chat |
POST | Ask questions about audio |
/describe |
POST | Get audio description |
Configuration
Edit config.yaml to customize:
model:
device: "cuda:0" # GPU device
server:
host: "0.0.0.0"
port: 8000
generation:
max_new_tokens: 200
temperature: 1.0
Performance
Tested on NVIDIA L4 (24GB):
| Metric | Value |
|---|---|
| Model Load Time | ~20s |
| Audio Encode | ~250ms |
| Time to First Token | ~150ms |
| Tokens/second | ~18 |
| GPU Memory | ~16GB |
Important Note
This repository uses Vicuna 7B v1.5 (not v1.1). The original SALMONN checkpoint was trained with v1.5, and using v1.1 will result in broken outputs (<unk> tokens).
License
- SALMONN: Apache 2.0
- Vicuna: Llama 2 Community License
- Whisper: MIT
Credits
- Downloads last month
- 6
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support