|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- ru |
|
base_model: |
|
- openai/whisper-large-v3-turbo |
|
pipeline_tag: automatic-speech-recognition |
|
datasets: |
|
- mozilla-foundation/common_voice_17_0 |
|
- bond005/rulibrispeech |
|
- bond005/podlodka_speech |
|
- bond005/sberdevices_golos_10h_crowd |
|
- bond005/sberdevices_golos_100h_farfield |
|
- bond005/taiga_speech_v2 |
|
- bond005/audioset-nonspeech |
|
metrics: |
|
- wer |
|
model-index: |
|
- name: Whisper-Podlodka-Turbo by Ivan Bondarenko |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Podlodka Speech |
|
type: bond005/podlodka_speech |
|
args: ru |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 7.81 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Common Voice ru |
|
type: mozilla-foundation/common_voice_11_0 |
|
args: ru |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 5.22 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Sova RuDevices |
|
type: bond005/sova_rudevices |
|
args: ru |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 15.26 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Russian Librispeech |
|
type: bond005/rulibrispeech |
|
args: ru |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 9.61 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Sberdevices Golos (farfield) |
|
type: bond005/sberdevices_golos_100h_farfield |
|
args: ru |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 11.26 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Sberdevices Golos (crowd) |
|
type: bond005/sberdevices_golos_10h_crowd |
|
args: ru |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 11.82 |
|
--- |
|
|
|
# Whisper-Podlodka-Turbo |
|
|
|
Whisper-Podlodka-Turbo is a new fine-tuned version of a [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo). The main goal of the fine-tuning is to improve the quality of speech recognition and speech translation for Russian and English, as well as reduce the occurrence of hallucinations when processing non-speech audio signals. |
|
|
|
## Model Description |
|
|
|
**Whisper-Podlodka-Turbo** is a new fine-tuned version of [Whisper-Large-V3-Turbo](https://huggingface.co/openai/whisper-large-v3-turbo), optimized for high-quality Russian speech recognition with proper punctuation + capitalization and enhanced with noise resistance capability. |
|
|
|
### Key Benefits |
|
|
|
- 🎯 Improved Russian speech recognition quality compared to the base Whisper-Large-V3-Turbo model |
|
- ✍️ Correct Russian punctuation and capitalization |
|
- 🎧 Enhanced background noise resistance |
|
- 🚫 Reduced number of hallucinations, especially in non-speech segments |
|
|
|
### Supported Tasks |
|
|
|
- Automatic Speech Recognition (ASR): |
|
- 🇷🇺 Russian (primary focus) |
|
- 🇬🇧 English |
|
- Speech Translation: |
|
- Russian ↔️ English |
|
- Speech Language Detection (including non-speech detection) |
|
|
|
## Uses |
|
|
|
### Installation |
|
|
|
**Whisper-Podlodka-Turbo** is supported in Hugging Face 🤗 [Transformers](https://huggingface.co/docs/transformers/index). To run the model, first install the Transformers library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and 🤗 Accelerate to reduce the model loading time: |
|
|
|
```bash |
|
pip install --upgrade pip |
|
pip install --upgrade transformers datasets[audio] accelerate |
|
``` |
|
|
|
Also, I recommend using [`whisper-lid`](https://github.com/bond005/whisper-lid) for initial spoken language detection. Therefore, this library is also worth installing: |
|
|
|
```bash |
|
pip install --upgrade whisper-lid |
|
``` |
|
|
|
### Usages Cases |
|
|
|
#### Speech recognition |
|
|
|
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) class to transcribe audios of arbitrary language: |
|
|
|
```python |
|
import librosa # for loading sound from local file |
|
from transformers import pipeline # for working with Whisper-Podlodka-Turbo |
|
import wget # for downloading demo sound from its URL |
|
from whisper_lid.whisper_lid import detect_language_in_speech # for spoken language detection |
|
|
|
model_id = "bond005/whisper-podlodka-turbo" # the best Whisper model :-) |
|
target_sampling_rate = 16_000 # Hz |
|
|
|
asr = pipeline(model=model_id, device_map='auto', torch_dtype='auto') |
|
|
|
# An example of speech recognition in Russian, spoken by a native speaker of this language |
|
sound_ru_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_ru.wav' |
|
sound_ru_name = wget.download(sound_ru_url) |
|
sound_ru = librosa.load(sound_ru_name, sr=target_sampling_rate, mono=True)[0] |
|
print('Duration of sound with Russian speech = {0:.3f} seconds.'.format( |
|
sound_ru.shape[0] / target_sampling_rate |
|
)) |
|
detected_languages = detect_language_in_speech( |
|
sound_ru, |
|
asr.feature_extractor, |
|
asr.tokenizer, |
|
asr.model |
|
) |
|
print('Top-3 languages:') |
|
lang_text_width = max([len(it[0]) for it in detected_languages]) |
|
for it in detected_languages[0:3]: |
|
print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1])) |
|
recognition_result = asr( |
|
sound_ru, |
|
generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]}, |
|
return_timestamps=False |
|
) |
|
print(recognition_result['text'] + '\n') |
|
|
|
# An example of speech recognition in English, pronounced by a non-native speaker of that language with an accent |
|
sound_en_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_en.wav' |
|
sound_en_name = wget.download(sound_en_url) |
|
sound_en = librosa.load(sound_en_name, sr=target_sampling_rate, mono=True)[0] |
|
print('Duration of sound with English speech = {0:.3f} seconds.'.format( |
|
sound_en.shape[0] / target_sampling_rate |
|
)) |
|
detected_languages = detect_language_in_speech( |
|
sound_en, |
|
asr.feature_extractor, |
|
asr.tokenizer, |
|
asr.model |
|
) |
|
print('Top-3 languages:') |
|
lang_text_width = max([len(it[0]) for it in detected_languages]) |
|
for it in detected_languages[0:3]: |
|
print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1])) |
|
recognition_result = asr( |
|
sound_en, |
|
generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]}, |
|
return_timestamps=False |
|
) |
|
print(recognition_result['text'] + '\n') |
|
``` |
|
|
|
As a result, you can see a text output like this: |
|
|
|
```text |
|
Duration of sound with Russian speech = 29.947 seconds. |
|
Top-3 languages: |
|
russian 0.9568 |
|
english 0.0372 |
|
ukrainian 0.0013 |
|
Ну, виспер сам по себе. Что такое виспер? Виспер — это уже полноценное end-to-end нейросетевое решение с авторегрессионным декодером, то есть это не чистый энкодер, как Wave2Vec, это не просто текстовый сек-то-сек, энкодер-декодер, как T5, это полноценный алгоритм преобразования речи в текст, где энкодер учитывает, прежде всего, акустические фичи речи, ну и семантика тоже постепенно подмешивается, а декодер — это уже языковая модель, которая генерирует токен за токеном. |
|
|
|
Duration of sound with English speech = 20.247 seconds. |
|
Top-3 languages: |
|
english 0.9526 |
|
russian 0.0311 |
|
polish 0.0006 |
|
Ensembling can help us to solve a well-known bias-variance trade-off. We can decrease variance on basis of large ensemble, large ensemble of different algorithms. |
|
|
|
``` |
|
|
|
#### Speech recognition with timestamps |
|
|
|
In addition to the usual recognition, the model can also provide timestamps for recognized speech fragments: |
|
|
|
```python |
|
recognition_result = asr( |
|
sound_ru, |
|
generate_kwargs={'task': 'transcribe', 'language': 'russian', |
|
return_timestamps=True |
|
) |
|
print('Recognized chunks of Russian speech:') |
|
for it in recognition_result['chunks']: |
|
print(f' {it}') |
|
|
|
recognition_result = asr( |
|
sound_en, |
|
generate_kwargs={'task': 'transcribe', 'language': 'english', |
|
return_timestamps=True |
|
) |
|
print('\nRecognized chunks of English speech:') |
|
for it in recognition_result['chunks']: |
|
print(f' {it}') |
|
``` |
|
|
|
As a result, you can see a text output like this: |
|
|
|
```text |
|
Recognized chunks of Russian speech: |
|
{'timestamp': (0.0, 4.8), 'text': 'Ну, виспер, сам по себе, что такое виспер. Виспер — это уже полноценное'} |
|
{'timestamp': (4.8, 8.4), 'text': ' end-to-end нейросетевое решение с авторегрессионным декодером.'} |
|
{'timestamp': (8.4, 10.88), 'text': ' То есть, это не чистый энкодер, как Wave2Vec.'} |
|
{'timestamp': (10.88, 15.6), 'text': ' Это не просто текстовый сек-то-сек, энкодер-декодер, как T5.'} |
|
{'timestamp': (15.6, 19.12), 'text': ' Это полноценный алгоритм преобразования речи в текст,'} |
|
{'timestamp': (19.12, 23.54), 'text': ' где энкодер учитывает, прежде всего, акустические фичи речи,'} |
|
{'timestamp': (23.54, 25.54), 'text': ' ну и семантика тоже постепенно подмешивается,'} |
|
{'timestamp': (25.54, 29.94), 'text': ' а декодер — это уже языковая модель, которая генерирует токен за токеном.'} |
|
|
|
Recognized chunks of English speech: |
|
{'timestamp': (0.0, 8.08), 'text': 'Ensembling can help us to solve a well-known bias-variance trade-off.'} |
|
{'timestamp': (8.96, 20.08), 'text': 'We can decrease variance on basis of large ensemble, large ensemble of different algorithms.'} |
|
``` |
|
|
|
#### Voice activity detection (speech/non-speech) |
|
|
|
Along with special language tokens, the model can also return the special token `<|nospeech|>`, if the input audio signal does not contain any speech (for details, see section 2.3 of the [corresponding paper about Whisper](https://arxiv.org/pdf/2212.04356)). This skill of the model forms the basis of the speech/non-speech classification algorithm, as demonstrated in the following example: |
|
|
|
```python |
|
nonspeech_sound_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_nonspeech.wav' |
|
nonspeech_sound_name = wget.download(nonspeech_sound_url) |
|
nonspeech_sound = librosa.load(nonspeech_sound_name, sr=target_sampling_rate, mono=True)[0] |
|
print('Duration of sound without speech = {0:.3f} seconds.'.format( |
|
nonspeech_sound.shape[0] / target_sampling_rate |
|
)) |
|
detected_languages = detect_language_in_speech( |
|
nonspeech_sound, |
|
asr.feature_extractor, |
|
asr.tokenizer, |
|
asr.model |
|
) |
|
print('Top-3 languages:') |
|
lang_text_width = max([len(it[0]) for it in detected_languages]) |
|
for it in detected_languages[0:3]: |
|
print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1])) |
|
``` |
|
|
|
As a result, you can see a text output like this: |
|
|
|
```text |
|
Duration of sound without speech = 10.000 seconds. |
|
Top-3 languages: |
|
NO SPEECH 0.9957 |
|
lingala 0.0002 |
|
cantonese 0.0002 |
|
``` |
|
|
|
#### Speech translation |
|
|
|
In addition to the transcription task, the model also performs speech translation (although it translates better from Russian into English than from English into Russian): |
|
|
|
```python |
|
print(f'Speech translation from Russian to English:') |
|
recognition_result = asr( |
|
sound_ru, |
|
generate_kwargs={'task': 'translate', 'language': 'english'}, |
|
return_timestamps=False |
|
) |
|
print(recognition_result['text'] + '\n') |
|
|
|
print(f'Speech translation from English to Russian:') |
|
recognition_result = asr( |
|
sound_en, |
|
generate_kwargs={'task': 'translate', 'language': 'russian'}, |
|
return_timestamps=False |
|
) |
|
print(recognition_result['text'] + '\n') |
|
``` |
|
|
|
As a result, you can see a text output like this: |
|
|
|
```text |
|
Speech translation from Russian to English: |
|
Well, Visper, what is Visper? Visper is already a complete end-to-end neural network with an autoregressive decoder. That is, it's not a pure encoder like Wave2Vec, it's not just a text-to-seq encoder-decoder like T5, it's a complete algorithm for the transformation of speech into text, where the encoder considers, first of all, acoustic features of speech, well, and the semantics are also gradually moving, and the decoder is already a language model that generates token by token. |
|
|
|
Speech translation from English to Russian: |
|
Энсемблинг может помочь нам осуществлять хорошо известный торговый байз-вариант. Мы можем ограничить варианты на основе крупного энсембла, крупного энсембла разных алгоритмов. |
|
``` |
|
|
|
As you can see, in both examples the speech translation contains some errors, however in the example of translation from English to Russian these errors are more significant. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
- While improvements are observed for English and translation tasks, statistically significant advantages are confirmed only for Russian ASR |
|
- The model's performance on [code-switching speech](https://en.wikipedia.org/wiki/Code-switching) (where speakers alternate between Russian and English within the same utterance) has not been specifically evaluated |
|
- Inherits basic limitations of the Whisper architecture |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model was fine-tuned on a composite dataset including: |
|
- [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) (Ru, En) |
|
- [Podlodka Speech](https://huggingface.co/datasets/bond005/podlodka_speech) (Ru) |
|
- [Taiga Speech](https://huggingface.co/datasets/bond005/taiga_speech_v2) (Ru, synthetic) |
|
- [Golos Farfield](https://huggingface.co/datasets/bond005/sberdevices_golos_100h_farfield) and [Golos Crowd](https://huggingface.co/datasets/bond005/sberdevices_golos_10h_crowd) (Ru) |
|
- [Sova Rudevices](https://huggingface.co/datasets/bond005/sova_rudevices) (Ru) |
|
- [Audioset](https://huggingface.co/datasets/bond005/audioset-nonspeech) (non-speech audio) |
|
|
|
### Training Features |
|
|
|
**1. Data Augmentation:** |
|
- Dynamic mixing of speech with background noise and music |
|
- Gradual reduction of signal-to-noise ratio during training |
|
|
|
**2. Text Data Processing:** |
|
- Russian text punctuation and capitalization restoration using [bond005/ruT5-ASR-large](https://huggingface.co/bond005/ruT5-ASR-large) (for speech sub-corpora without punctuated annotations) |
|
- Parallel Russian-English text generation using [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) |
|
- Multi-stage validation of generated texts to minimize hallucinations using [bond005/xlm-roberta-xl-hallucination-detector](https://huggingface.co/bond005/xlm-roberta-xl-hallucination-detector) |
|
|
|
**3. Training Strategy:** |
|
- Progressive increase in training example complexity |
|
- Balanced sampling between speech and non-speech data |
|
- Special handling of language tokens and no-speech detection (`<|nospeech|>`) |
|
|
|
## Evaluation |
|
|
|
The experimental evaluation focused on two main tasks: |
|
|
|
1. Russian speech recognition |
|
2. Speech activity detection (binary classification "speech/non-speech") |
|
|
|
Testing was performed on publicly available Russian speech corpora. Speech recognition was conducted using [the standard pipeline](https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) from the Hugging Face 🤗 [Transformers library](https://huggingface.co/docs/transformers/index). Due to the limitations of this pipeline in language identification and non-speech detection (caused by a certain bug), the [whisper-lid](https://github.com/bond005/whisper-lid) library was used for speech presence/absence detection in the signal. |
|
|
|
### Testing Data & Metrics |
|
|
|
#### Testing Data |
|
|
|
The quality of the Russian speech recognition task was tested on test sub-sets of six different datasets: |
|
|
|
- [Common Voice 11 Ru](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) |
|
- [Podlodka Speech](https://huggingface.co/datasets/bond005/podlodka_speech) |
|
- [Golos Farfield](https://huggingface.co/datasets/bond005/sberdevices_golos_100h_farfield) |
|
- [Golos Crowd](https://huggingface.co/datasets/bond005/sberdevices_golos_10h_crowd) |
|
- [Sova Rudevices](https://huggingface.co/datasets/bond005/sova_rudevices) |
|
- [Russian Librispeech](https://huggingface.co/datasets/bond005/rulibrispeech) |
|
|
|
The quality of the voice activity detection task was tested on test sub-sets of two different datasets: |
|
|
|
- [noised version of Golos Crowd](https://huggingface.co/datasets/bond005/sberdevices_golos_10h_crowd_noised_2db) as a source of speech samples |
|
- [filtered sub-set of Audioset corpus](https://huggingface.co/datasets/bond005/audioset-nonspeech) as a source of non-speech samples |
|
|
|
Noise was added using [a special augmenter](https://github.com/dangrebenkin/audio_augmentator) capable of simulating the superposition of five different types of acoustic noise (reverberation, speech-like sounds, music, household sounds, and pet sounds) at a given signal-to-noise ratio (in this case, a signal-to-noise ratio of 2 dB was used). |
|
|
|
The quality of the *robust* Russian speech recognition task was tested on test sub-set of above-mentioned [noised Golos Crowd](https://huggingface.co/datasets/bond005/sberdevices_golos_10h_crowd_noised_2db). |
|
|
|
#### Metrics |
|
|
|
**1. Modified [WER (Word Error Rate)](https://en.wikipedia.org/wiki/Word_error_rate)** for Russian speech recognition quality: |
|
- Text normalization before WER calculation: |
|
- Unification of numeral representations (digits/words) |
|
- Standardization of foreign words (Cyrillic/Latin scripts) |
|
- Accounting for valid transliteration variants |
|
- Enables more accurate assessment of semantic recognition accuracy |
|
- The lower the WER, the better the speech recognition quality |
|
|
|
**2. [F1-score](https://en.wikipedia.org/wiki/F-score)** for speech activity detection: |
|
- Binary classification "speech/non-speech" |
|
- Evaluation of non-speech segment detection accuracy using `<|nospeech|>` token |
|
- The higher the F1 score, the better the voice activity detection quality |
|
|
|
### Results |
|
|
|
#### Automatic Speech Recognition (ASR) |
|
|
|
*Result (WER, %)*: |
|
|
|
| Dataset | bond005/whisper-podlodka-turbo | openai/whisper-large-v3-turbo | |
|
|----------------------------|--------------------------------|-------------------------------| |
|
| bond005/podlodka_speech | 7.81 | 8.33 | |
|
| rulibrispeech | 9.61 | 10.25 | |
|
| sberdevices_golos_farfield | 11.26 | 20.12 | |
|
| sberdevices_golos_crowd | 11.82 | 14.55 | |
|
| sova_rudevices | 15.26 | 17.70 | |
|
| common_voice_11_0 | 5.22 | 6.63 | |
|
|
|
#### Voice Activity Detection (VAD) |
|
|
|
*Result (F1)*: |
|
|
|
| bond005/whisper-podlodka-turbo | openai/whisper-large-v3-turbo | |
|
|--------------------------------|-------------------------------| |
|
| 0.9214 | 0.8484 | |
|
|
|
#### Robust ASR (SNR = 2 dB, speech-like noise, music, etc.) |
|
|
|
*Result (WER, %)*: |
|
|
|
| Dataset | bond005/whisper-podlodka-turbo | openai/whisper-large-v3-turbo | |
|
|----------------------------------|--------------------------------|-------------------------------| |
|
| sberdevices_golos_crowd (noised) | 46.14 | 75.20 | |
|
|
|
## Citation |
|
|
|
If you use this model in your work, please cite it as: |
|
|
|
```bibtex |
|
@misc{whisper-podlodka-turbo, |
|
author = {Ivan Bondarenko}, |
|
title = {Whisper-Podlodka-Turbo: Enhanced Whisper Model for Russian ASR}, |
|
year = {2025}, |
|
publisher = {Hugging Face}, |
|
journal = {Hugging Face Model Hub}, |
|
howpublished = {\url{https://huggingface.co/bond005/whisper-podlodka-turbo}} |
|
} |
|
``` |