metadata

library_name: transformers
license: apache-2.0
language:
  - en
  - ru
base_model:
  - openai/whisper-large-v3-turbo
pipeline_tag: automatic-speech-recognition
datasets:
  - mozilla-foundation/common_voice_17_0
  - bond005/rulibrispeech
  - bond005/podlodka_speech
  - bond005/sberdevices_golos_10h_crowd
  - bond005/sberdevices_golos_100h_farfield
  - bond005/taiga_speech_v2
  - bond005/audioset-nonspeech
metrics:
  - wer
model-index:
  - name: Whisper-Podlodka-Turbo by Ivan Bondarenko
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Podlodka Speech
          type: bond005/podlodka_speech
          args: ru
        metrics:
          - name: Test WER
            type: wer
            value: 7.81
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice ru
          type: mozilla-foundation/common_voice_11_0
          args: ru
        metrics:
          - name: Test WER
            type: wer
            value: 5.22
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Sova RuDevices
          type: bond005/sova_rudevices
          args: ru
        metrics:
          - name: Test WER
            type: wer
            value: 15.26
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Russian Librispeech
          type: bond005/rulibrispeech
          args: ru
        metrics:
          - name: Test WER
            type: wer
            value: 9.61
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Sberdevices Golos (farfield)
          type: bond005/sberdevices_golos_100h_farfield
          args: ru
        metrics:
          - name: Test WER
            type: wer
            value: 11.26
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Sberdevices Golos (crowd)
          type: bond005/sberdevices_golos_10h_crowd
          args: ru
        metrics:
          - name: Test WER
            type: wer
            value: 11.82

Whisper-Podlodka-Turbo

Whisper-Podlodka-Turbo is a new fine-tuned version of a Whisper large-v3-turbo. The main goal of the fine-tuning is to improve the quality of speech recognition and speech translation for Russian and English, as well as reduce the occurrence of hallucinations when processing non-speech audio signals.

Model Description

Whisper-Podlodka-Turbo is a new fine-tuned version of Whisper-Large-V3-Turbo, optimized for high-quality Russian speech recognition with proper punctuation + capitalization and enhanced with noise resistance capability.

Key Benefits

🎯 Improved Russian speech recognition quality compared to the base Whisper-Large-V3-Turbo model
✍️ Correct Russian punctuation and capitalization
🎧 Enhanced background noise resistance
🚫 Reduced number of hallucinations, especially in non-speech segments

Supported Tasks

Automatic Speech Recognition (ASR):
- 🇷🇺 Russian (primary focus)
- 🇬🇧 English
Speech Translation:
- Russian ↔️ English
Speech Language Detection (including non-speech detection)

Uses

Installation

Whisper-Podlodka-Turbo is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and 🤗 Accelerate to reduce the model loading time:

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

Also, I recommend using whisper-lid for initial spoken language detection. Therefore, this library is also worth installing:

pip install --upgrade whisper-lid

Usages Cases

Speech recognition

The model can be used with the pipeline class to transcribe audios of arbitrary language:

import librosa  # for loading sound from local file
from transformers import pipeline  # for working with Whisper-Podlodka-Turbo
import wget  # for downloading demo sound from its URL
from whisper_lid.whisper_lid import detect_language_in_speech  # for spoken language detection

model_id = "bond005/whisper-podlodka-turbo"  # the best Whisper model :-)
target_sampling_rate = 16_000  # Hz

asr = pipeline(model=model_id, device_map='auto', torch_dtype='auto')

# An example of speech recognition in Russian, spoken by a native speaker of this language
sound_ru_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_ru.wav'
sound_ru_name = wget.download(sound_ru_url)
sound_ru = librosa.load(sound_ru_name, sr=target_sampling_rate, mono=True)[0]
print('Duration of sound with Russian speech = {0:.3f} seconds.'.format(
    sound_ru.shape[0] / target_sampling_rate
))
detected_languages = detect_language_in_speech(
    sound_ru,
    asr.feature_extractor,
    asr.tokenizer,
    asr.model
)
print('Top-3 languages:')
lang_text_width = max([len(it[0]) for it in detected_languages])
for it in detected_languages[0:3]:
    print('  {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
recognition_result = asr(
    sound_ru,
    generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]},
    return_timestamps=False
)
print(recognition_result['text'] + '\n')

# An example of speech recognition in English, pronounced by a non-native speaker of that language with an accent
sound_en_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_en.wav'
sound_en_name = wget.download(sound_en_url)
sound_en = librosa.load(sound_en_name, sr=target_sampling_rate, mono=True)[0]
print('Duration of sound with English speech = {0:.3f} seconds.'.format(
    sound_en.shape[0] / target_sampling_rate
))
detected_languages = detect_language_in_speech(
    sound_en,
    asr.feature_extractor,
    asr.tokenizer,
    asr.model
)
print('Top-3 languages:')
lang_text_width = max([len(it[0]) for it in detected_languages])
for it in detected_languages[0:3]:
    print('  {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
recognition_result = asr(
    sound_en,
    generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]},
    return_timestamps=False
)
print(recognition_result['text'] + '\n')

As a result, you can see a text output like this:

Duration of sound with Russian speech = 29.947 seconds.
Top-3 languages:
         russian 0.9568
         english 0.0372
       ukrainian 0.0013
Ну, виспер сам по себе. Что такое виспер? Виспер — это уже полноценное end-to-end нейросетевое решение с авторегрессионным декодером, то есть это не чистый энкодер, как Wave2Vec, это не просто текстовый сек-то-сек, энкодер-декодер, как T5, это полноценный алгоритм преобразования речи в текст, где энкодер учитывает, прежде всего, акустические фичи речи, ну и семантика тоже постепенно подмешивается, а декодер — это уже языковая модель, которая генерирует токен за токеном.

Duration of sound with English speech = 20.247 seconds.
Top-3 languages:
         english 0.9526
         russian 0.0311
          polish 0.0006
Ensembling can help us to solve a well-known bias-variance trade-off. We can decrease variance on basis of large ensemble, large ensemble of different algorithms.

Speech recognition with timestamps

In addition to the usual recognition, the model can also provide timestamps for recognized speech fragments:

recognition_result = asr(
    sound_ru,
    generate_kwargs={'task': 'transcribe', 'language': 'russian',
    return_timestamps=True
)
print('Recognized chunks of Russian speech:')
for it in recognition_result['chunks']:
    print(f'  {it}')

recognition_result = asr(
    sound_en,
    generate_kwargs={'task': 'transcribe', 'language': 'english',
    return_timestamps=True
)
print('\nRecognized chunks of English speech:')
for it in recognition_result['chunks']:
    print(f'  {it}')

As a result, you can see a text output like this:

Recognized chunks of Russian speech:
  {'timestamp': (0.0, 4.8), 'text': 'Ну, виспер, сам по себе, что такое виспер. Виспер — это уже полноценное'}
  {'timestamp': (4.8, 8.4), 'text': ' end-to-end нейросетевое решение с авторегрессионным декодером.'}
  {'timestamp': (8.4, 10.88), 'text': ' То есть, это не чистый энкодер, как Wave2Vec.'}
  {'timestamp': (10.88, 15.6), 'text': ' Это не просто текстовый сек-то-сек, энкодер-декодер, как T5.'}
  {'timestamp': (15.6, 19.12), 'text': ' Это полноценный алгоритм преобразования речи в текст,'}
  {'timestamp': (19.12, 23.54), 'text': ' где энкодер учитывает, прежде всего, акустические фичи речи,'}
  {'timestamp': (23.54, 25.54), 'text': ' ну и семантика тоже постепенно подмешивается,'}
  {'timestamp': (25.54, 29.94), 'text': ' а декодер — это уже языковая модель, которая генерирует токен за токеном.'}

Recognized chunks of English speech:
  {'timestamp': (0.0, 8.08), 'text': 'Ensembling can help us to solve a well-known bias-variance trade-off.'}
  {'timestamp': (8.96, 20.08), 'text': 'We can decrease variance on basis of large ensemble, large ensemble of different algorithms.'}

Voice activity detection (speech/non-speech)

Along with special language tokens, the model can also return the special token <|nospeech|>, if the input audio signal does not contain any speech (for details, see section 2.3 of the corresponding paper about Whisper). This skill of the model forms the basis of the speech/non-speech classification algorithm, as demonstrated in the following example:

nonspeech_sound_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_nonspeech.wav'
nonspeech_sound_name = wget.download(nonspeech_sound_url)
nonspeech_sound = librosa.load(nonspeech_sound_name, sr=target_sampling_rate, mono=True)[0]
print('Duration of sound without speech = {0:.3f} seconds.'.format(
    nonspeech_sound.shape[0] / target_sampling_rate
))
detected_languages = detect_language_in_speech(
    nonspeech_sound,
    asr.feature_extractor,
    asr.tokenizer,
    asr.model
)
print('Top-3 languages:')
lang_text_width = max([len(it[0]) for it in detected_languages])
for it in detected_languages[0:3]:
    print('  {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))

As a result, you can see a text output like this:

Duration of sound without speech = 10.000 seconds.
Top-3 languages:
       NO SPEECH 0.9957
         lingala 0.0002
       cantonese 0.0002

Speech translation

In addition to the transcription task, the model also performs speech translation (although it translates better from Russian into English than from English into Russian):

print(f'Speech translation from Russian to English:')
recognition_result = asr(
    sound_ru,
    generate_kwargs={'task': 'translate', 'language': 'english'},
    return_timestamps=False
)
print(recognition_result['text'] + '\n')

print(f'Speech translation from English to Russian:')
recognition_result = asr(
    sound_en,
    generate_kwargs={'task': 'translate', 'language': 'russian'},
    return_timestamps=False
)
print(recognition_result['text'] + '\n')

As a result, you can see a text output like this:

Speech translation from Russian to English:
Well, Visper, what is Visper? Visper is already a complete end-to-end neural network with an autoregressive decoder. That is, it's not a pure encoder like Wave2Vec, it's not just a text-to-seq encoder-decoder like T5, it's a complete algorithm for the transformation of speech into text, where the encoder considers, first of all, acoustic features of speech, well, and the semantics are also gradually moving, and the decoder is already a language model that generates token by token.

Speech translation from English to Russian:
Энсемблинг может помочь нам осуществлять хорошо известный торговый байз-вариант. Мы можем ограничить варианты на основе крупного энсембла, крупного энсембла разных алгоритмов.

As you can see, in both examples the speech translation contains some errors, however in the example of translation from English to Russian these errors are more significant.

Bias, Risks, and Limitations

While improvements are observed for English and translation tasks, statistically significant advantages are confirmed only for Russian ASR
The model's performance on code-switching speech (where speakers alternate between Russian and English within the same utterance) has not been specifically evaluated
Inherits basic limitations of the Whisper architecture

Training Details

Training Data

The model was fine-tuned on a composite dataset including:

Common Voice (Ru, En)
Podlodka Speech (Ru)
Taiga Speech (Ru, synthetic)
Golos Farfield and Golos Crowd (Ru)
Sova Rudevices (Ru)
Audioset (non-speech audio)

Training Features

1. Data Augmentation:

Dynamic mixing of speech with background noise and music
Gradual reduction of signal-to-noise ratio during training

2. Text Data Processing:

Russian text punctuation and capitalization restoration using bond005/ruT5-ASR-large (for speech sub-corpora without punctuated annotations)
Parallel Russian-English text generation using Qwen/Qwen2.5-14B-Instruct
Multi-stage validation of generated texts to minimize hallucinations using bond005/xlm-roberta-xl-hallucination-detector

3. Training Strategy:

Progressive increase in training example complexity
Balanced sampling between speech and non-speech data
Special handling of language tokens and no-speech detection (<|nospeech|>)

Evaluation

The experimental evaluation focused on two main tasks:

Russian speech recognition
Speech activity detection (binary classification "speech/non-speech")

Testing was performed on publicly available Russian speech corpora. Speech recognition was conducted using the standard pipeline from the Hugging Face 🤗 Transformers library. Due to the limitations of this pipeline in language identification and non-speech detection (caused by a certain bug), the whisper-lid library was used for speech presence/absence detection in the signal.

Testing Data & Metrics

Testing Data

The quality of the Russian speech recognition task was tested on test sub-sets of six different datasets:

The quality of the voice activity detection task was tested on test sub-sets of two different datasets:

noised version of Golos Crowd as a source of speech samples
filtered sub-set of Audioset corpus as a source of non-speech samples

Noise was added using a special augmenter capable of simulating the superposition of five different types of acoustic noise (reverberation, speech-like sounds, music, household sounds, and pet sounds) at a given signal-to-noise ratio (in this case, a signal-to-noise ratio of 2 dB was used).

The quality of the robust Russian speech recognition task was tested on test sub-set of above-mentioned noised Golos Crowd.

Metrics

1. Modified WER (Word Error Rate) for Russian speech recognition quality:

Text normalization before WER calculation:
- Unification of numeral representations (digits/words)
- Standardization of foreign words (Cyrillic/Latin scripts)
- Accounting for valid transliteration variants
Enables more accurate assessment of semantic recognition accuracy
The lower the WER, the better the speech recognition quality

2. F1-score for speech activity detection:

Binary classification "speech/non-speech"
Evaluation of non-speech segment detection accuracy using <|nospeech|> token
The higher the F1 score, the better the voice activity detection quality

Results

Automatic Speech Recognition (ASR)

Result (WER, %):

Dataset	bond005/whisper-podlodka-turbo	openai/whisper-large-v3-turbo
bond005/podlodka_speech	7.81	8.33
rulibrispeech	9.61	10.25
sberdevices_golos_farfield	11.26	20.12
sberdevices_golos_crowd	11.82	14.55
sova_rudevices	15.26	17.70
common_voice_11_0	5.22	6.63

Voice Activity Detection (VAD)

Result (F1):

bond005/whisper-podlodka-turbo	openai/whisper-large-v3-turbo
0.9214	0.8484

Robust ASR (SNR = 2 dB, speech-like noise, music, etc.)

Result (WER, %):

Dataset	bond005/whisper-podlodka-turbo	openai/whisper-large-v3-turbo
sberdevices_golos_crowd (noised)	46.14	75.20

Citation

If you use this model in your work, please cite it as:

@misc{whisper-podlodka-turbo,
  author = {Ivan Bondarenko},
  title = {Whisper-Podlodka-Turbo: Enhanced Whisper Model for Russian ASR},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/bond005/whisper-podlodka-turbo}}
}