Wav2Vec2 ONNX Models
This repository contains ONNX-converted versions of popular Wav2Vec2 ASR models for multiple languages, optimized for inference in production environments.
Overview
This project provides pre-converted Wav2Vec2 models from PyTorch/Hugging Face format to ONNX format, making them suitable for deployment in various production environments. The conversion maintains model accuracy while providing significant inference speed improvements.
Supported languages:
- English (based on facebook/wav2vec2-base-960h)
- French (based on facebook/wav2vec2-base-10k-voxpopuli-ft-fr)
- German (based on facebook/wav2vec2-base-10k-voxpopuli-ft-de)
- Spanish (based on facebook/wav2vec2-base-10k-voxpopuli-ft-es)
- Italian (based on facebook/wav2vec2-base-10k-voxpopuli-ft-it)
Why ONNX?
ONNX (Open Neural Network Exchange) provides:
- Improved performance: Faster inference compared to PyTorch models
- Cross-platform support: Run models on various hardware and software platforms
- Optimized deployment: Better integration with production systems
- Runtime flexibility: Compatible with ONNX Runtime, which supports CPU, GPU, and specialized hardware
Model Structure
Each model directory contains:
model.onnx
: The ONNX-converted Wav2Vec2 modelvocab.json
: Vocabulary mapping for the tokenizertokenizer.json
: Fast Tokenizers library configurationtokenizer_config.json
: Tokenizer configurationmetadata.json
: Information about the model and conversion process
Installation
# Clone this repository
git clone https://huggingface.co/YOUR_USERNAME/wav2vec2-onnx-models
# Install required dependencies
pip install onnxruntime
pip install transformers
pip install soundfile
Usage
Python Example
import soundfile as sf
import numpy as np
import onnxruntime as ort
import json
from transformers import Wav2Vec2Processor
# Load audio file
audio, sampling_rate = sf.read("audio.wav")
if len(audio.shape) > 1:
audio = audio[:, 0] # Take first channel if stereo
if sampling_rate != 16000:
# You'll need to resample to 16kHz
# Load processor - using the same one from the original model
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
# Or load from local files in the model directory
# Preprocess audio
inputs = processor(audio, sampling_rate=16000, return_tensors="np", padding=True)
input_values = inputs.input_values
# Load ONNX model and run inference
ort_session = ort.InferenceSession("en_wav2vec2-base-960h/model.onnx")
ort_inputs = {ort_session.get_inputs()[0].name: input_values}
ort_outs = ort_session.run(None, ort_inputs)
# Decode predictions
predicted_ids = np.argmax(ort_outs[0], axis=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)
Command-line Usage
You can also use the conversion script to convert your own models:
python convert_wav2vec2_onnx.py
Benchmarks
The following benchmarks were generated using the included benchmark.py
script on sample audio data. To run the benchmarks yourself:
# Install dependencies
pip install -r requirements-benchmark.txt
# Run benchmark across all languages
python benchmark.py --onnx_models_dir "path/to/onnx/models" --output_dir "benchmark_results"
Language | Model | PyTorch (ms) | ONNX (ms) | Speedup |
---|---|---|---|---|
en | wav2vec2-base-960h | 142.5 | 61.4 | 2.3x |
fr | wav2vec2-base-10k-voxpopuli-ft-fr | 137.8 | 63.5 | 2.2x |
de | wav2vec2-base-10k-voxpopuli-ft-de | 138.2 | 62.1 | 2.2x |
es | wav2vec2-base-10k-voxpopuli-ft-es | 139.4 | 62.8 | 2.2x |
it | wav2vec2-base-10k-voxpopuli-ft-it | 141.2 | 65.3 | 2.2x |
Benchmarks performed on CPU: Intel Core i7-10700K, 32GB RAM. Each result is the average of 10 runs with 5 audio samples per language.
Citation
If you use these models in your research or applications, please cite both the original models and this repository:
@misc{wav2vec2,
author = {Alexei Baevski and Henry Zhou and Abdelrahman Mohamed and Michael Auli},
title = {wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
year = {2020},
publisher = {arXiv},
howpublished = {\url{https://arxiv.org/abs/2006.11477}},
}
@misc{voxpopuli,
title={VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation},
author={Wang, Changhan and Rivière, Morgane and Lee, Ann and Wu, Anne and Talnikar, Chaitanya and Haziza, Daniel and Williamson, Mary and Pino, Juan and Dupoux, Emmanuel},
year={2021},
publisher={arXiv},
howpublished={\url{https://arxiv.org/abs/2101.00390}},
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgements
- Original models from Facebook Research (Wav2Vec2) and Meta AI (VoxPopuli)
- Hugging Face for the transformers library
- ONNX and ONNX Runtime communities