Xcodec and Xcodec2
Collection
Transformer supported versions of X-Codec models: https://github.com/zhenye234/xcodec?tab=readme-ov-file#available-models
•
6 items
•
Updated
•
1
The X-Codec2 model was proposed in Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis.
X-Codec2 is a neural audio codec designed to improve speech synthesis and general audio generation for large language model (LLM) pipelines. It extends the original X-Codec by refining how semantic and acoustic information is integrated and tokenized, enabling efficient and high-fidelity audio representation.
Its architecture is based on X-Codec with several major differences:
Here is a quick example of how to encode and decode an audio using this model:
>>> import torch
>>> from datasets import Audio, load_dataset
>>> from transformers import AutoFeatureExtractor, Xcodec2Model
>>> torch_device = "cuda" if torch.cuda.is_available() else "cpu"
>>> # load model and feature extractor
>>> model_id = "bezzam/xcodec2"
>>> model = Xcodec2Model.from_pretrained(model_id).to(torch_device).eval()
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
>>> # load data
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
>>> audio = dataset[0]["audio"]["array"]
>>> # prepare data
>>> inputs = feature_extractor(raw_audio=audio, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to(torch_device)
>>> # encoder and decode
>>> audio_codes = model.encode(inputs["input_values"]).audio_codes
>>> audio_values = model.decode(audio_codes).audio_values
>>> # or the equivalent with a forward pass
>>> model_output = model(inputs["input_values"])
>>> audio_codes = model_output.audio_codes
>>> audio_values = model_output.audio_values
This model was contributed by Steven Zheng and Eric Bezzam. The original code can be found here, and original checkpoints here.