Liquid AI

LFM2‑Audio-1.5B

LFM2-Audio-1.5B is Liquid AI's first end-to-end audio foundation model. Designed with low latency and real time conversation in mind, at only 1.5 billion parameters LFM2-Audio enables seamless conversational interaction, achieving capabilities on par with much larger models.

LFM2-Audio is an end-to-end multimodal speech and text language model, and as such does not require separate ASR and TTS components. Our model consists of a pretrained LFM2 model as its multimodal backbone, along with a FastConformer based audio encoder to handle continuous audio inputs, and a RQ-transformer generating discrete Mimi tokens as audio output.

LFM2-Audio supports two distinct generation routines, each suitable for a set of tasks. Interleaved generation enables real-time speech-to-speech conversational chatbot capabilities, where audio generation latency is key. Sequential generation is suited for non-conversational tasks such as ASR or TTS, and allows the model to switch generated modality on the fly.

πŸ“„ Model details

Property
Parameters (LM only) 1.2B
Audio encoder FastConformer (115M, canary-180m-flash)
Backbone layers hybrid conv+attention
Audio tokenizer Mimi, using 8 codebooks
Context 32,768 tokens
Vocab size 65,536 (text) / 2049*8 (audio)
Precision bfloat16
License LFM Open License v1.0

Supported languages: English

πŸƒ How to run LFM2-Audio

Install the liquid-audio package via pip

pip install liquid-audio
pip install "liquid-audio [demo]" # optional, to install demo dependencies
pip install flash-attn --no-build-isolation  # optional, to use flash attention 2. Will fallback to torch SDPA if not installed

Gradio demo

The simplest way to get started is by running the Gradio demo interface. After installation, run the command

liquid-audio-demo

This will start a webserver on port 7860. The interface can then be accessed via the URL http://localhost:7860/.

Multi-turn, multi-modal chat

The liquid-audio provides a lower lever interface to the model and generation routines, ideal for custom usecases. We demonstrate this with a simple multi-turn chat, where the first turn is given as audio, and the second turn is given as text.

For multi-turn chat with text and audio output, we use interleaved generation. The system prompt should be set to Respond with interleaved text and audio.. Here we use audio as the first user turn, and text as the second one.

import torch
import torchaudio
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor, ChatState, LFMModality

# Load models
HF_REPO = "LiquidAI/LFM2-Audio-1.5B"

processor = LFM2AudioProcessor.from_pretrained(HF_REPO).eval()
model = LFM2AudioModel.from_pretrained(HF_REPO).eval()

# Set up inputs for the model
chat = ChatState(processor)

chat.new_turn("system")
chat.add_text("Respond with interleaved text and audio.")
chat.end_turn()

chat.new_turn("user")
wav, sampling_rate = torchaudio.load("assets/question.wav")
chat.add_audio(wav, sampling_rate)
chat.end_turn()

chat.new_turn("assistant")

# Generate text and audio tokens.
text_out: list[torch.Tensor] = []
audio_out: list[torch.Tensor] = []
modality_out: list[LFMModality] = []
for t in model.generate_interleaved(**chat, max_new_tokens=512, audio_temperature=1.0, audio_top_k=4):
    if t.numel() == 1:
        print(processor.text.decode(t), end="", flush=True)
        text_out.append(t)
        modality_out.append(LFMModality.TEXT)
    else:
        audio_out.append(t)
        modality_out.append(LFMModality.AUDIO_OUT)

# output: Sure! How about "Handcrafted Woodworking, Precision Made for You"? Another option could be "Quality Woodworking, Quality Results." If you want something more personal, you might try "Your Woodworking Needs, Our Expertise."

# Detokenize audio, removing the last "end-of-audio" codes
# Mimi returns audio at 24kHz
mimi_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
with torch.no_grad():
    waveform = processor.mimi.decode(mimi_codes)[0]
torchaudio.save("answer1.wav", waveform.cpu(), 24_000)

# Append newly generated tokens to chat history
chat.append(
    text = torch.stack(text_out, 1),
    audio_out = torch.stack(audio_out, 1),
    modality_flag = torch.tensor(modality_out),
)
chat.end_turn()

# Start new turn
chat.new_turn("user")
chat.add_text("My business specialized in chairs, can you give me something related to that?")
chat.end_turn()

chat.new_turn("assistant")

# Generate second turn text and audio tokens.
audio_out: list[torch.Tensor] = []
for t in model.generate_interleaved(**chat, max_new_tokens=512, audio_temperature=1.0, audio_top_k=4):
    if t.numel() == 1:
        print(processor.text.decode(t), end="", flush=True)
    else:
        audio_out.append(t)

# output: Sure thing! How about β€œComfortable Chairs, Crafted with Care” or β€œElegant Seats, Handcrafted for You”? Let me know if you’d like a few more options.

# Detokenize second turn audio, removing the last "end-of-audio" codes
mimi_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
with torch.no_grad():
    waveform = processor.mimi.decode(mimi_codes)[0]
torchaudio.save("answer2.wav", waveform.cpu(), 24_000)

ASR, TTS, additional information

Please visit the liquid-audio package repository for additional examples and sample audio snippets.

πŸ“ˆ Performance

VoiceBench (audio input)

Higher is better. AlpacaEval, CommonEval and WildVoice are scored out of 5.

Model Components & Size AlpacaEval CommonEval WildVoice SD-QA MMSU OBQA BBH IFEval ADVBench Overall
LFM2-Audio-1.5B 1.5B parameters 3.71 3.49 3.17 30.56 31.95 44.40 30.54 98.85 67.33 56.78
Moshi 7B parameters 2.01 1.60 1.30 15.64 24.04 25.93 47.40 10.12 44.23 29.51
Qwen2.5-Omni-3B 5B parameters 3.72 3.51 3.42 44.94 55.29 76.26 61.30 32.90 88.46 63.57
Mini-Omni2 0.6B parameters 2.32 2.18 1.79 9.31 24.27 26.59 46.40 11.56 57.50 33.49

ASR

Word Error Rate (WER), lower is better.

Model Components & Size Audio output Open AMI GigaSpeech LibriSpeech-clean LibriSpeech-other TED-LIUM Average
LFM2-Audio-1.5B 1.5B parameters Yes Yes 15.58 10.67 2.01 4.39 3.56 7.24
Qwen2.5-Omni-3B 5B parameters Yes Yes 15.95 10.02 2.01 3.91 3.86 7.15
Whisper-large-V3 1.5B parameters No β€” ASR only Yes 16.73 10.76 2.73 5.54 3.91 7.93
elevenlabs/scribe_v1 unknown No β€” ASR only No 14.43 9.66 1.79 3.31 3.17 6.47

πŸ“¬ Contact

If you are interested in custom solutions with edge deployment, please contact our sales team.

License

The code in this the package repository and associated weights are licensed under the LFM Open License v1.0.

The code for the audio encoder is based on Nvidia NeMo, licensed under Apache 2.0, and the canary-180m-flash checkpoint, licensed under CC-BY 4.0. To simplify dependency resolution, we also ship the Python code of Kyutai Mimi, licensed under the MIT License. We also redistribute weights for Kyutai Mimi, licensed under CC-BY-4.0.

Downloads last month
427
Safetensors
Model size
1.47B params
Tensor type
I64
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for LiquidAI/LFM2-Audio-1.5B

Base model

LiquidAI/LFM2-1.2B
Finetuned
(29)
this model

Space using LiquidAI/LFM2-Audio-1.5B 1