YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SALMONN Inference Server

A ready-to-use inference server for SALMONN (Speech Audio Language Music Open Neural Network) - a multimodal LLM that can understand speech, audio events, and music.

Features

  • Audio transcription
  • Question answering about audio content
  • Audio description and analysis
  • FastAPI server with REST API
  • Simple Python API

Requirements

  • GPU: NVIDIA GPU with 24GB+ VRAM (L4, A100, RTX 4090, etc.)
  • CUDA: 11.8 or higher
  • Python: 3.10+
  • Storage: ~25GB for model checkpoints

Quick Start

1. Clone and Install

git clone https://huggingface.co/marcosremar2/salmonn-inference
cd salmonn-inference

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

2. Download Models

./install.sh

This downloads (~20GB):

  • Vicuna 7B v1.5 (LLM backbone)
  • Whisper Large v2 (speech encoder)
  • BEATs (audio encoder)
  • SALMONN checkpoint (adapter weights)

3. Start Server

python server.py

Server runs at http://localhost:8000

API Usage

Transcribe Audio

curl -X POST "http://localhost:8000/transcribe" \
  -F "audio=@your_audio.wav"

Ask Questions

curl -X POST "http://localhost:8000/chat" \
  -F "audio=@your_audio.wav" \
  -F "question=What is being said in this audio?"

Python API

from inference import SALMONNInference

model = SALMONNInference()
model.load()

# Transcribe
text = model.transcribe("audio.wav")

# Ask questions
answer = model.chat("audio.wav", "What language is being spoken?")

# Describe audio
description = model.describe("audio.wav")

API Endpoints

Endpoint Method Description
/ GET API info
/health GET Health check
/transcribe POST Transcribe audio to text
/chat POST Ask questions about audio
/describe POST Get audio description

Configuration

Edit config.yaml to customize:

model:
  device: "cuda:0"  # GPU device

server:
  host: "0.0.0.0"
  port: 8000

generation:
  max_new_tokens: 200
  temperature: 1.0

Performance

Tested on NVIDIA L4 (24GB):

Metric Value
Model Load Time ~20s
Audio Encode ~250ms
Time to First Token ~150ms
Tokens/second ~18
GPU Memory ~16GB

Important Note

This repository uses Vicuna 7B v1.5 (not v1.1). The original SALMONN checkpoint was trained with v1.5, and using v1.1 will result in broken outputs (<unk> tokens).

License

  • SALMONN: Apache 2.0
  • Vicuna: Llama 2 Community License
  • Whisper: MIT

Credits

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support