Model Card for UniSS

Model Details

Model Description

UniSS is a unified single-stage speech-to-speech translation (S2ST) framework that achieves high translation fidelity and speech quality, while preserving timbre, emotion, and duration consistency. UniSS supports English and Chinese now.

Model Sources

Quick Start

  1. Install the environment and get the code
conda create -n uniss python=3.10.16
conda activate uniss
git clone https://github.com/cmots/UniSS.git
cd UniSS
pip install -r requirements.txt
# If you are in mainland China, you can set the mirror as follows:
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
  1. Download the weight

The weight of UniSS is on HuggingFace.

You have to download the model manually, you can download it via provided script:

python download_weight.py

or download via git clone (skip this if you have download via python script):

mkdir -p pretrained_models

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install

git clone https://huggingface.co/cmots/UniSS pretrained_models/UniSS
  1. Run the code
import soundfile
from uniss import UniSSTokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from uniss import process_input, process_output

# 1. Set the device, wav path, model path
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

wav_path = "prompt_audio.wav"
model_path = "pretrained_models/UniSS"

# 2. Set the mode and target language
mode = 'Quality'    # 'Quality' or 'Performance'
tgt_lang = "<|eng|>"    # for English output
# tgt_lang = "<|cmn|>"  # for Chinese output

# 3. load the model, text tokenizer, and speech tokenizer
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_path)

speech_tokenizer = UniSSTokenizer.from_pretrained(model_path, device=device)

# 4. extract speech tokens
glm4_tokens, bicodec_tokens = speech_tokenizer.tokenize(wav_path)


# 5. process the input
input_text = process_input(glm4_tokens, bicodec_tokens, mode, tgt_lang)
input_token_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)

# 6. translate the speech
output = model.generate(
    input_token_ids,
    max_new_tokens=1500,
    temperature=0.7,
    top_p=0.8,
    repetition_penalty=1.1
)

# 7. decode the output
output_text = tokenizer.batch_decode(output, skip_special_tokens=True)

# 8. process the output
audio, translation, transcription = process_output(output_text[0], input_text, speech_tokenizer, mode, device)

# 9. save and show the results
soundfile.write("output_audio.wav", audio, 16000)

if mode == 'Quality':
    print("Transcription:\n", transcription)
print("Translation:\n", translation)

More examples and details is on Our Github Repo.

Citation

If you find our paper and code useful in your research, please consider giving a like and citation.

@misc{cheng2025uniss_s2st,
      title={UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice}, 
      author={Sitong Cheng and Weizhen Bian and Xinsheng Wang and Ruibin Yuan and Jianyi Chen and Shunshun Yin and Yike Guo and Wei Xue},
      year={2025},
      eprint={2509.21144},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2509.21144}, 
}
Downloads last month
34
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cmots/UniSS

Base model

Qwen/Qwen2.5-1.5B
Finetuned
(1240)
this model
Quantizations
1 model