YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

PAST: Phonetic-Acoustic Speech Tokenizer

Authors: Nadav Har-Tuv, Or Tal, Yossi Adi
Affiliation: The Hebrew University of Jerusalem

πŸ“„ Paper PDF | 🌐 Project Page | πŸ’» Code

Schematic of the PAST pipeline. The auxiliary heads use the output of the first vector quantization module as input.

Abstract

We present PAST, a novel end-to-end framework that jointly models phonetic information alongside signal reconstruction, eliminating the need for external pretrained models. Unlike previous approaches that rely on pretrained self-supervised models, PAST employs supervised phonetic data, directly integrating domain knowledge into the tokenization process via auxiliary tasks. Additionally, we introduce a streamable, causal variant of PAST, enabling real-time speech applications. Results demonstrate that PAST surpasses existing evaluated baseline tokenizers across common evaluation metrics, including phonetic representation and speech reconstruction. Notably, PAST also achieves superior performance when serving as a speech representation for speech language models, further highlighting its effectiveness as a foundation for spoken language generation.

Samples

Audio samples are available on our project demo page.

Model List

Model Variant Description
PAST Full PAST model trained on LibriSpeech + TIMIT
PAST_streamable Streamable Causal variant with 20ms look-ahead

Usage

Pre-requisites

Install

conda create -n past_env python=3.10 -y
conda activate past_env
pip install git+https://github.com/slp-rl/PAST.git

Clone

git clone https://github.com/slp-rl/PAST.git
conda create -n past_env python=3.10 -y
conda activate past_env
pip install -r requirements.txt

Inference

# ---------------
# load PAST model
# ---------------

from past.models.past_model import PastModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = PastModel.from_pretrained("PAST", device=device)  # one of ['PAST', 'PAST_streamable']


# ----------------------------------------------------------------------
# Run on audio: PAST expects a batched input format [Batch, Channels, T]
# ----------------------------------------------------------------------
import torchaudio

def read_one_wav(path, target_sr):
    wav, sr = torchaudio.load(path)
    if sr != target_sr:
        wav = torchaudio.transforms.Resample(sr, target_sr)(wav)
    if wav.shape[0] == 2:
        wav = wav[:1]
    return wav.unsqueeze(0)

wav = read_one_wav("path/to/audio.wav", model.sample_rate).to(device)

with torch.no_grad():
    codes, scale = model.encode(wav)
    reconstructed = model.decode(codes, scale)

Evaluation

See Eval README


Results (from the paper)

Phonetic Information

Tokenizer PNMI ↑ ABX ↓ Within ABX ↓ Across WER ↓ Clean WER ↓ Other
D. HuBERT 500 0.67 3.91 4.73 11.3 24.7
SpeechTokenizer 0.72 3.43 4.50 18.5 41.3
X-Codec 0.40 9.42 12.6 17.1 37.1
PAST 0.75 2.82 3.54 15.7 36.8
PAST - Streamable 0.74 3.05 3.89 14.3 32.3

Reconstruction Quality

Tokenizer SISNR ↑ VISQOL ↑ PESQ ↑
EnCodec 7.49 4.48 3.88
SpeechTokenizer 0.44 4.38 3.15
X-Codec -7.12 4.46 3.33
PAST 4.84 4.40 3.55
PAST - Streamable 3.90 4.37 3.40

Speech Language Modeling (sWUGGY)

Tokenizer sWUGGY ↑ Inter sWUGGY ↑ OOV
EnCodec 56.3 53.7
D. HuBERT 500 67.9 55.4
SpeechTokenizer 63.7 55.6
X-Codec 55.1 52.9
PAST 71.8 57.5
PAST - Streamable 70.2 56.3

Citation

If you use PAST in your work, please cite:

@article{har2025past,
    title={Past: Phonetic-acoustic speech tokenizer},
    author={Har-Tuv, Nadav and Tal, Or and Adi, Yossi},
    journal={arXiv preprint arXiv:2505.14470},
    year={2025}
  }
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support