Wav2Vec2 ONNX Models

This repository contains ONNX-converted versions of popular Wav2Vec2 ASR models for multiple languages, optimized for inference in production environments.

Overview

This project provides pre-converted Wav2Vec2 models from PyTorch/Hugging Face format to ONNX format, making them suitable for deployment in various production environments. The conversion maintains model accuracy while providing significant inference speed improvements.

Supported languages:

English (based on facebook/wav2vec2-base-960h)
French (based on facebook/wav2vec2-base-10k-voxpopuli-ft-fr)
German (based on facebook/wav2vec2-base-10k-voxpopuli-ft-de)
Spanish (based on facebook/wav2vec2-base-10k-voxpopuli-ft-es)
Italian (based on facebook/wav2vec2-base-10k-voxpopuli-ft-it)

Why ONNX?

ONNX (Open Neural Network Exchange) provides:

Improved performance: Faster inference compared to PyTorch models
Cross-platform support: Run models on various hardware and software platforms
Optimized deployment: Better integration with production systems
Runtime flexibility: Compatible with ONNX Runtime, which supports CPU, GPU, and specialized hardware

Model Structure

Each model directory contains:

model.onnx: The ONNX-converted Wav2Vec2 model
vocab.json: Vocabulary mapping for the tokenizer
tokenizer.json: Fast Tokenizers library configuration
tokenizer_config.json: Tokenizer configuration
metadata.json: Information about the model and conversion process

Installation

# Clone this repository
git clone https://huggingface.co/YOUR_USERNAME/wav2vec2-onnx-models

# Install required dependencies
pip install onnxruntime
pip install transformers
pip install soundfile

Usage

Python Example

import soundfile as sf
import numpy as np
import onnxruntime as ort
import json
from transformers import Wav2Vec2Processor

# Load audio file
audio, sampling_rate = sf.read("audio.wav")
if len(audio.shape) > 1:
    audio = audio[:, 0]  # Take first channel if stereo
if sampling_rate != 16000:
    # You'll need to resample to 16kHz

# Load processor - using the same one from the original model
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
# Or load from local files in the model directory

# Preprocess audio
inputs = processor(audio, sampling_rate=16000, return_tensors="np", padding=True)
input_values = inputs.input_values

# Load ONNX model and run inference
ort_session = ort.InferenceSession("en_wav2vec2-base-960h/model.onnx")
ort_inputs = {ort_session.get_inputs()[0].name: input_values}
ort_outs = ort_session.run(None, ort_inputs)

# Decode predictions
predicted_ids = np.argmax(ort_outs[0], axis=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)

Command-line Usage

You can also use the conversion script to convert your own models:

python convert_wav2vec2_onnx.py

Benchmarks

The following benchmarks were generated using the included benchmark.py script on sample audio data. To run the benchmarks yourself:

# Install dependencies
pip install -r requirements-benchmark.txt

# Run benchmark across all languages
python benchmark.py --onnx_models_dir "path/to/onnx/models" --output_dir "benchmark_results"

Language	Model	PyTorch (ms)	ONNX (ms)	Speedup
en	wav2vec2-base-960h	142.5	61.4	2.3x
fr	wav2vec2-base-10k-voxpopuli-ft-fr	137.8	63.5	2.2x
de	wav2vec2-base-10k-voxpopuli-ft-de	138.2	62.1	2.2x
es	wav2vec2-base-10k-voxpopuli-ft-es	139.4	62.8	2.2x
it	wav2vec2-base-10k-voxpopuli-ft-it	141.2	65.3	2.2x

Benchmarks performed on CPU: Intel Core i7-10700K, 32GB RAM. Each result is the average of 10 runs with 5 audio samples per language.

Citation

If you use these models in your research or applications, please cite both the original models and this repository:

@misc{wav2vec2,
  author = {Alexei Baevski and Henry Zhou and Abdelrahman Mohamed and Michael Auli},
  title = {wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
  year = {2020},
  publisher = {arXiv},
  howpublished = {\url{https://arxiv.org/abs/2006.11477}},
}

@misc{voxpopuli,
  title={VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation},
  author={Wang, Changhan and Rivière, Morgane and Lee, Ann and Wu, Anne and Talnikar, Chaitanya and Haziza, Daniel and Williamson, Mary and Pino, Juan and Dupoux, Emmanuel},
  year={2021},
  publisher={arXiv},
  howpublished={\url{https://arxiv.org/abs/2101.00390}},
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

Original models from Facebook Research (Wav2Vec2) and Meta AI (VoxPopuli)
Hugging Face for the transformers library
ONNX and ONNX Runtime communities

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

darjusul
/

wav2vec2-ONNX-collection