Audio-to-Audio
Transformers
Safetensors
xcodec2

Xcodec2 (Transformers-compatible version)

The X-Codec2 model was proposed in Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis.

X-Codec2 is a neural audio codec designed to improve speech synthesis and general audio generation for large language model (LLM) pipelines. It extends the original X-Codec by refining how semantic and acoustic information is integrated and tokenized, enabling efficient and high-fidelity audio representation.

Its architecture is based on X-Codec with several major differences:

  • Unified Semantic-Acoustic Tokenization: X-Codec2 fuses outputs from a semantic encoder (e.g., Wav2Vec2-BERT) and an acoustic encoder into a single embedding, capturing both high-level meaning (e.g., text content, emotion) and low-level audio details (e.g., timbre).
  • Single-Stage Vector Quantization (VQ): Unlike the multi-layer residual VQ in most approaches (e.g., X-Codec, DAC, EnCodec), X-Codec2 uses a single-layer Feature-Space Quantization (FSQ) for stability and compatibility with causal, autoregressive LLMs.
  • Semantic Supervision During Training: It adds a semantic reconstruction loss, ensuring that the discrete tokens preserve meaningful linguistic and emotional information — crucial for TTS tasks.
  • Transformer-Friendly Design: The 1D token structure of X-Codec2 naturally aligns with the autoregressive modeling in LLMs like LLaMA, improving training efficiency and downstream compatibility.

Usage example

Here is a quick example of how to encode and decode an audio using this model:

>>> import torch
>>> from datasets import Audio, load_dataset
>>> from transformers import AutoFeatureExtractor, Xcodec2Model

>>> torch_device = "cuda" if torch.cuda.is_available() else "cpu"

>>> # load model and feature extractor
>>> model_id = "bezzam/xcodec2"
>>> model = Xcodec2Model.from_pretrained(model_id).to(torch_device).eval()
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)

>>> # load data
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
>>> audio = dataset[0]["audio"]["array"]

>>> # prepare data
>>> inputs = feature_extractor(raw_audio=audio, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to(torch_device)

>>> # encoder and decode
>>> audio_codes = model.encode(inputs["input_values"]).audio_codes
>>> audio_values = model.decode(audio_codes).audio_values
>>> # or the equivalent with a forward pass
>>> model_output = model(inputs["input_values"])
>>> audio_codes = model_output.audio_codes
>>> audio_values = model_output.audio_values

This model was contributed by Steven Zheng and Eric Bezzam. The original code can be found here, and original checkpoints here.

Downloads last month
27
Safetensors
Model size
823M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including hf-audio/xcodec2