YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SALMONN Inference Server

A ready-to-use inference server for SALMONN (Speech Audio Language Music Open Neural Network) - a multimodal LLM that can understand speech, audio events, and music.

Features

Audio transcription
Question answering about audio content
Audio description and analysis
FastAPI server with REST API
Simple Python API

Requirements

GPU: NVIDIA GPU with 24GB+ VRAM (L4, A100, RTX 4090, etc.)
CUDA: 11.8 or higher
Python: 3.10+
Storage: ~25GB for model checkpoints

Quick Start

1. Clone and Install

git clone https://huggingface.co/marcosremar2/salmonn-inference
cd salmonn-inference

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

2. Download Models

./install.sh

This downloads (~20GB):

Vicuna 7B v1.5 (LLM backbone)
Whisper Large v2 (speech encoder)
BEATs (audio encoder)
SALMONN checkpoint (adapter weights)

3. Start Server

python server.py

Server runs at http://localhost:8000

API Usage

Transcribe Audio

curl -X POST "http://localhost:8000/transcribe" \
  -F "audio=@your_audio.wav"

Ask Questions

curl -X POST "http://localhost:8000/chat" \
  -F "audio=@your_audio.wav" \
  -F "question=What is being said in this audio?"

Python API

from inference import SALMONNInference

model = SALMONNInference()
model.load()

# Transcribe
text = model.transcribe("audio.wav")

# Ask questions
answer = model.chat("audio.wav", "What language is being spoken?")

# Describe audio
description = model.describe("audio.wav")

API Endpoints

Endpoint	Method	Description
`/`	GET	API info
`/health`	GET	Health check
`/transcribe`	POST	Transcribe audio to text
`/chat`	POST	Ask questions about audio
`/describe`	POST	Get audio description

Configuration

Edit config.yaml to customize:

model:
  device: "cuda:0"  # GPU device

server:
  host: "0.0.0.0"
  port: 8000

generation:
  max_new_tokens: 200
  temperature: 1.0

Performance

Tested on NVIDIA L4 (24GB):

Metric	Value
Model Load Time	~20s
Audio Encode	~250ms
Time to First Token	~150ms
Tokens/second	~18
GPU Memory	~16GB

Important Note

This repository uses Vicuna 7B v1.5 (not v1.1). The original SALMONN checkpoint was trained with v1.5, and using v1.1 will result in broken outputs (<unk> tokens).

License

SALMONN: Apache 2.0
Vicuna: Llama 2 Community License
Whisper: MIT

Credits

SALMONN by Tsinghua University & ByteDance
Vicuna by LMSYS
Whisper by OpenAI

Downloads last month: 6

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support