library_name: transformers
license: apache-2.0
language:
- en
- ru
base_model:
- openai/whisper-large-v3-turbo
pipeline_tag: automatic-speech-recognition
datasets:
- mozilla-foundation/common_voice_17_0
- bond005/rulibrispeech
- bond005/podlodka_speech
- bond005/sberdevices_golos_10h_crowd
- bond005/sberdevices_golos_100h_farfield
- bond005/taiga_speech_v2
- bond005/audioset-nonspeech
metrics:
- wer
model-index:
- name: Whisper-Podlodka-Turbo by Ivan Bondarenko
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Podlodka Speech
type: bond005/podlodka_speech
args: ru
metrics:
- name: Test WER
type: wer
value: 7.81
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice ru
type: mozilla-foundation/common_voice_11_0
args: ru
metrics:
- name: Test WER
type: wer
value: 5.22
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Sova RuDevices
type: bond005/sova_rudevices
args: ru
metrics:
- name: Test WER
type: wer
value: 15.26
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Russian Librispeech
type: bond005/rulibrispeech
args: ru
metrics:
- name: Test WER
type: wer
value: 9.61
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Sberdevices Golos (farfield)
type: bond005/sberdevices_golos_100h_farfield
args: ru
metrics:
- name: Test WER
type: wer
value: 11.26
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Sberdevices Golos (crowd)
type: bond005/sberdevices_golos_10h_crowd
args: ru
metrics:
- name: Test WER
type: wer
value: 11.82
Whisper-Podlodka-Turbo
Whisper-Podlodka-Turbo is a new fine-tuned version of a Whisper large-v3-turbo. The main goal of the fine-tuning is to improve the quality of speech recognition and speech translation for Russian and English, as well as reduce the occurrence of hallucinations when processing non-speech audio signals.
Model Description
Whisper-Podlodka-Turbo is a new fine-tuned version of Whisper-Large-V3-Turbo, optimized for high-quality Russian speech recognition with proper punctuation + capitalization and enhanced with noise resistance capability.
Key Benefits
- 🎯 Improved Russian speech recognition quality compared to the base Whisper-Large-V3-Turbo model
- ✍️ Correct Russian punctuation and capitalization
- 🎧 Enhanced background noise resistance
- 🚫 Reduced number of hallucinations, especially in non-speech segments
Supported Tasks
- Automatic Speech Recognition (ASR):
- 🇷🇺 Russian (primary focus)
- 🇬🇧 English
- Speech Translation:
- Russian ↔️ English
- Speech Language Detection (including non-speech detection)
Uses
Installation
Whisper-Podlodka-Turbo is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and 🤗 Accelerate to reduce the model loading time:
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate
Also, I recommend using whisper-lid
for initial spoken language detection. Therefore, this library is also worth installing:
pip install --upgrade whisper-lid
Usages Cases
Speech recognition
The model can be used with the pipeline
class to transcribe audios of arbitrary language:
import librosa # for loading sound from local file
from transformers import pipeline # for working with Whisper-Podlodka-Turbo
import wget # for downloading demo sound from its URL
from whisper_lid.whisper_lid import detect_language_in_speech # for spoken language detection
model_id = "bond005/whisper-podlodka-turbo" # the best Whisper model :-)
target_sampling_rate = 16_000 # Hz
asr = pipeline(model=model_id, device_map='auto', torch_dtype='auto')
# An example of speech recognition in Russian, spoken by a native speaker of this language
sound_ru_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_ru.wav'
sound_ru_name = wget.download(sound_ru_url)
sound_ru = librosa.load(sound_ru_name, sr=target_sampling_rate, mono=True)[0]
print('Duration of sound with Russian speech = {0:.3f} seconds.'.format(
sound_ru.shape[0] / target_sampling_rate
))
detected_languages = detect_language_in_speech(
sound_ru,
asr.feature_extractor,
asr.tokenizer,
asr.model
)
print('Top-3 languages:')
lang_text_width = max([len(it[0]) for it in detected_languages])
for it in detected_languages[0:3]:
print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
recognition_result = asr(
sound_ru,
generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]},
return_timestamps=False
)
print(recognition_result['text'] + '\n')
# An example of speech recognition in English, pronounced by a non-native speaker of that language with an accent
sound_en_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_en.wav'
sound_en_name = wget.download(sound_en_url)
sound_en = librosa.load(sound_en_name, sr=target_sampling_rate, mono=True)[0]
print('Duration of sound with English speech = {0:.3f} seconds.'.format(
sound_en.shape[0] / target_sampling_rate
))
detected_languages = detect_language_in_speech(
sound_en,
asr.feature_extractor,
asr.tokenizer,
asr.model
)
print('Top-3 languages:')
lang_text_width = max([len(it[0]) for it in detected_languages])
for it in detected_languages[0:3]:
print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
recognition_result = asr(
sound_en,
generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]},
return_timestamps=False
)
print(recognition_result['text'] + '\n')
As a result, you can see a text output like this:
Duration of sound with Russian speech = 29.947 seconds.
Top-3 languages:
russian 0.9568
english 0.0372
ukrainian 0.0013
Ну, виспер сам по себе. Что такое виспер? Виспер — это уже полноценное end-to-end нейросетевое решение с авторегрессионным декодером, то есть это не чистый энкодер, как Wave2Vec, это не просто текстовый сек-то-сек, энкодер-декодер, как T5, это полноценный алгоритм преобразования речи в текст, где энкодер учитывает, прежде всего, акустические фичи речи, ну и семантика тоже постепенно подмешивается, а декодер — это уже языковая модель, которая генерирует токен за токеном.
Duration of sound with English speech = 20.247 seconds.
Top-3 languages:
english 0.9526
russian 0.0311
polish 0.0006
Ensembling can help us to solve a well-known bias-variance trade-off. We can decrease variance on basis of large ensemble, large ensemble of different algorithms.
Speech recognition with timestamps
In addition to the usual recognition, the model can also provide timestamps for recognized speech fragments:
recognition_result = asr(
sound_ru,
generate_kwargs={'task': 'transcribe', 'language': 'russian',
return_timestamps=True
)
print('Recognized chunks of Russian speech:')
for it in recognition_result['chunks']:
print(f' {it}')
recognition_result = asr(
sound_en,
generate_kwargs={'task': 'transcribe', 'language': 'english',
return_timestamps=True
)
print('\nRecognized chunks of English speech:')
for it in recognition_result['chunks']:
print(f' {it}')
As a result, you can see a text output like this:
Recognized chunks of Russian speech:
{'timestamp': (0.0, 4.8), 'text': 'Ну, виспер, сам по себе, что такое виспер. Виспер — это уже полноценное'}
{'timestamp': (4.8, 8.4), 'text': ' end-to-end нейросетевое решение с авторегрессионным декодером.'}
{'timestamp': (8.4, 10.88), 'text': ' То есть, это не чистый энкодер, как Wave2Vec.'}
{'timestamp': (10.88, 15.6), 'text': ' Это не просто текстовый сек-то-сек, энкодер-декодер, как T5.'}
{'timestamp': (15.6, 19.12), 'text': ' Это полноценный алгоритм преобразования речи в текст,'}
{'timestamp': (19.12, 23.54), 'text': ' где энкодер учитывает, прежде всего, акустические фичи речи,'}
{'timestamp': (23.54, 25.54), 'text': ' ну и семантика тоже постепенно подмешивается,'}
{'timestamp': (25.54, 29.94), 'text': ' а декодер — это уже языковая модель, которая генерирует токен за токеном.'}
Recognized chunks of English speech:
{'timestamp': (0.0, 8.08), 'text': 'Ensembling can help us to solve a well-known bias-variance trade-off.'}
{'timestamp': (8.96, 20.08), 'text': 'We can decrease variance on basis of large ensemble, large ensemble of different algorithms.'}
Voice activity detection (speech/non-speech)
Along with special language tokens, the model can also return the special token <|nospeech|>
, if the input audio signal does not contain any speech (for details, see section 2.3 of the corresponding paper about Whisper). This skill of the model forms the basis of the speech/non-speech classification algorithm, as demonstrated in the following example:
nonspeech_sound_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_nonspeech.wav'
nonspeech_sound_name = wget.download(nonspeech_sound_url)
nonspeech_sound = librosa.load(nonspeech_sound_name, sr=target_sampling_rate, mono=True)[0]
print('Duration of sound without speech = {0:.3f} seconds.'.format(
nonspeech_sound.shape[0] / target_sampling_rate
))
detected_languages = detect_language_in_speech(
nonspeech_sound,
asr.feature_extractor,
asr.tokenizer,
asr.model
)
print('Top-3 languages:')
lang_text_width = max([len(it[0]) for it in detected_languages])
for it in detected_languages[0:3]:
print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
As a result, you can see a text output like this:
Duration of sound without speech = 10.000 seconds.
Top-3 languages:
NO SPEECH 0.9957
lingala 0.0002
cantonese 0.0002
Speech translation
In addition to the transcription task, the model also performs speech translation (although it translates better from Russian into English than from English into Russian):
print(f'Speech translation from Russian to English:')
recognition_result = asr(
sound_ru,
generate_kwargs={'task': 'translate', 'language': 'english'},
return_timestamps=False
)
print(recognition_result['text'] + '\n')
print(f'Speech translation from English to Russian:')
recognition_result = asr(
sound_en,
generate_kwargs={'task': 'translate', 'language': 'russian'},
return_timestamps=False
)
print(recognition_result['text'] + '\n')
As a result, you can see a text output like this:
Speech translation from Russian to English:
Well, Visper, what is Visper? Visper is already a complete end-to-end neural network with an autoregressive decoder. That is, it's not a pure encoder like Wave2Vec, it's not just a text-to-seq encoder-decoder like T5, it's a complete algorithm for the transformation of speech into text, where the encoder considers, first of all, acoustic features of speech, well, and the semantics are also gradually moving, and the decoder is already a language model that generates token by token.
Speech translation from English to Russian:
Энсемблинг может помочь нам осуществлять хорошо известный торговый байз-вариант. Мы можем ограничить варианты на основе крупного энсембла, крупного энсембла разных алгоритмов.
As you can see, in both examples the speech translation contains some errors, however in the example of translation from English to Russian these errors are more significant.
Bias, Risks, and Limitations
- While improvements are observed for English and translation tasks, statistically significant advantages are confirmed only for Russian ASR
- The model's performance on code-switching speech (where speakers alternate between Russian and English within the same utterance) has not been specifically evaluated
- Inherits basic limitations of the Whisper architecture
Training Details
Training Data
The model was fine-tuned on a composite dataset including:
- Common Voice (Ru, En)
- Podlodka Speech (Ru)
- Taiga Speech (Ru, synthetic)
- Golos Farfield and Golos Crowd (Ru)
- Sova Rudevices (Ru)
- Audioset (non-speech audio)
Training Features
1. Data Augmentation:
- Dynamic mixing of speech with background noise and music
- Gradual reduction of signal-to-noise ratio during training
2. Text Data Processing:
- Russian text punctuation and capitalization restoration using bond005/ruT5-ASR-large (for speech sub-corpora without punctuated annotations)
- Parallel Russian-English text generation using Qwen/Qwen2.5-14B-Instruct
- Multi-stage validation of generated texts to minimize hallucinations using bond005/xlm-roberta-xl-hallucination-detector
3. Training Strategy:
- Progressive increase in training example complexity
- Balanced sampling between speech and non-speech data
- Special handling of language tokens and no-speech detection (
<|nospeech|>
)
Evaluation
The experimental evaluation focused on two main tasks:
- Russian speech recognition
- Speech activity detection (binary classification "speech/non-speech")
Testing was performed on publicly available Russian speech corpora. Speech recognition was conducted using the standard pipeline from the Hugging Face 🤗 Transformers library. Due to the limitations of this pipeline in language identification and non-speech detection (caused by a certain bug), the whisper-lid library was used for speech presence/absence detection in the signal.
Testing Data & Metrics
Testing Data
The quality of the Russian speech recognition task was tested on test sub-sets of six different datasets:
The quality of the voice activity detection task was tested on test sub-sets of two different datasets:
- noised version of Golos Crowd as a source of speech samples
- filtered sub-set of Audioset corpus as a source of non-speech samples
Noise was added using a special augmenter capable of simulating the superposition of five different types of acoustic noise (reverberation, speech-like sounds, music, household sounds, and pet sounds) at a given signal-to-noise ratio (in this case, a signal-to-noise ratio of 2 dB was used).
The quality of the robust Russian speech recognition task was tested on test sub-set of above-mentioned noised Golos Crowd.
Metrics
1. Modified WER (Word Error Rate) for Russian speech recognition quality:
- Text normalization before WER calculation:
- Unification of numeral representations (digits/words)
- Standardization of foreign words (Cyrillic/Latin scripts)
- Accounting for valid transliteration variants
- Enables more accurate assessment of semantic recognition accuracy
- The lower the WER, the better the speech recognition quality
2. F1-score for speech activity detection:
- Binary classification "speech/non-speech"
- Evaluation of non-speech segment detection accuracy using
<|nospeech|>
token - The higher the F1 score, the better the voice activity detection quality
Results
Automatic Speech Recognition (ASR)
Result (WER, %):
Dataset | bond005/whisper-podlodka-turbo | openai/whisper-large-v3-turbo |
---|---|---|
bond005/podlodka_speech | 7.81 | 8.33 |
rulibrispeech | 9.61 | 10.25 |
sberdevices_golos_farfield | 11.26 | 20.12 |
sberdevices_golos_crowd | 11.82 | 14.55 |
sova_rudevices | 15.26 | 17.70 |
common_voice_11_0 | 5.22 | 6.63 |
Voice Activity Detection (VAD)
Result (F1):
bond005/whisper-podlodka-turbo | openai/whisper-large-v3-turbo |
---|---|
0.9214 | 0.8484 |
Robust ASR (SNR = 2 dB, speech-like noise, music, etc.)
Result (WER, %):
Dataset | bond005/whisper-podlodka-turbo | openai/whisper-large-v3-turbo |
---|---|---|
sberdevices_golos_crowd (noised) | 46.14 | 75.20 |
Citation
If you use this model in your work, please cite it as:
@misc{whisper-podlodka-turbo,
author = {Ivan Bondarenko},
title = {Whisper-Podlodka-Turbo: Enhanced Whisper Model for Russian ASR},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/bond005/whisper-podlodka-turbo}}
}