PAST: Phonetic-Acoustic Speech Tokenizer

Authors: Nadav Har-Tuv, Or Tal, Yossi Adi
Affiliation: The Hebrew University of Jerusalem

Abstract

We present PAST, a novel end-to-end framework that jointly models phonetic information alongside signal reconstruction, eliminating the need for external pretrained models. Unlike previous approaches that rely on pretrained self-supervised models, PAST employs supervised phonetic data, directly integrating domain knowledge into the tokenization process via auxiliary tasks. Additionally, we introduce a streamable, causal variant of PAST, enabling real-time speech applications. Results demonstrate that PAST surpasses existing evaluated baseline tokenizers across common evaluation metrics, including phonetic representation and speech reconstruction. Notably, PAST also achieves superior performance when serving as a speech representation for speech language models, further highlighting its effectiveness as a foundation for spoken language generation.

Samples

Audio samples are available on our project demo page.

Model List

Model	Variant	Description
`PAST`	Full	PAST model trained on LibriSpeech + TIMIT
`PAST_streamable`	Streamable	Causal variant with 20ms look-ahead

Usage

Pre-requisites

Install

conda create -n past_env python=3.10 -y
conda activate past_env
pip install git+https://github.com/slp-rl/PAST.git

Clone

git clone https://github.com/slp-rl/PAST.git
conda create -n past_env python=3.10 -y
conda activate past_env
pip install -r requirements.txt

Inference

# ---------------
# load PAST model
# ---------------

from past.models.past_model import PastModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = PastModel.from_pretrained("PAST", device=device)  # one of ['PAST', 'PAST_streamable']


# ----------------------------------------------------------------------
# Run on audio: PAST expects a batched input format [Batch, Channels, T]
# ----------------------------------------------------------------------
import torchaudio

def read_one_wav(path, target_sr):
    wav, sr = torchaudio.load(path)
    if sr != target_sr:
        wav = torchaudio.transforms.Resample(sr, target_sr)(wav)
    if wav.shape[0] == 2:
        wav = wav[:1]
    return wav.unsqueeze(0)

wav = read_one_wav("path/to/audio.wav", model.sample_rate).to(device)

with torch.no_grad():
    codes, scale = model.encode(wav)
    reconstructed = model.decode(codes, scale)

Evaluation

See Eval README

Results (from the paper)

Phonetic Information

Tokenizer	PNMI ↑	ABX ↓ Within	ABX ↓ Across	WER ↓ Clean	WER ↓ Other
D. HuBERT 500	0.67	3.91	4.73	11.3	24.7
SpeechTokenizer	0.72	3.43	4.50	18.5	41.3
X-Codec	0.40	9.42	12.6	17.1	37.1
PAST	0.75	2.82	3.54	15.7	36.8
PAST - Streamable	0.74	3.05	3.89	14.3	32.3

Reconstruction Quality

Tokenizer	SISNR ↑	VISQOL ↑	PESQ ↑
EnCodec	7.49	4.48	3.88
SpeechTokenizer	0.44	4.38	3.15
X-Codec	-7.12	4.46	3.33
PAST	4.84	4.40	3.55
PAST - Streamable	3.90	4.37	3.40

Speech Language Modeling (sWUGGY)

Tokenizer	sWUGGY ↑ Inter	sWUGGY ↑ OOV
EnCodec	56.3	53.7
D. HuBERT 500	67.9	55.4
SpeechTokenizer	63.7	55.6
X-Codec	55.1	52.9
PAST	71.8	57.5
PAST - Streamable	70.2	56.3

Citation

If you use PAST in your work, please cite:

@article{har2025past,
    title={Past: Phonetic-acoustic speech tokenizer},
    author={Har-Tuv, Nadav and Tal, Or and Adi, Yossi},
    journal={arXiv preprint arXiv:2505.14470},
    year={2025}
  }