Update README.md

b495730 verified 1 day ago

20.8 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- en
	- ru
	base_model:
	- openai/whisper-large-v3-turbo
	pipeline_tag: automatic-speech-recognition
	datasets:
	- mozilla-foundation/common_voice_17_0
	- bond005/rulibrispeech
	- bond005/podlodka_speech
	- bond005/sberdevices_golos_10h_crowd
	- bond005/sberdevices_golos_100h_farfield
	- bond005/taiga_speech_v2
	- bond005/audioset-nonspeech
	metrics:
	- wer
	model-index:
	- name: Whisper-Podlodka-Turbo by Ivan Bondarenko
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Podlodka Speech
	type: bond005/podlodka_speech
	args: ru
	metrics:
	- name: Test WER
	type: wer
	value: 7.81
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice ru
	type: mozilla-foundation/common_voice_11_0
	args: ru
	metrics:
	- name: Test WER
	type: wer
	value: 5.22
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Sova RuDevices
	type: bond005/sova_rudevices
	args: ru
	metrics:
	- name: Test WER
	type: wer
	value: 15.26
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Russian Librispeech
	type: bond005/rulibrispeech
	args: ru
	metrics:
	- name: Test WER
	type: wer
	value: 9.61
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Sberdevices Golos (farfield)
	type: bond005/sberdevices_golos_100h_farfield
	args: ru
	metrics:
	- name: Test WER
	type: wer
	value: 11.26
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Sberdevices Golos (crowd)
	type: bond005/sberdevices_golos_10h_crowd
	args: ru
	metrics:
	- name: Test WER
	type: wer
	value: 11.82
	---

	# Whisper-Podlodka-Turbo

	Whisper-Podlodka-Turbo is a new fine-tuned version of a [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo). The main goal of the fine-tuning is to improve the quality of speech recognition and speech translation for Russian and English, as well as reduce the occurrence of hallucinations when processing non-speech audio signals.

	## Model Description

	Whisper-Podlodka-Turbo is a new fine-tuned version of [Whisper-Large-V3-Turbo](https://huggingface.co/openai/whisper-large-v3-turbo), optimized for high-quality Russian speech recognition with proper punctuation + capitalization and enhanced with noise resistance capability.

	### Key Benefits

	- 🎯 Improved Russian speech recognition quality compared to the base Whisper-Large-V3-Turbo model
	- ✍️ Correct Russian punctuation and capitalization
	- 🎧 Enhanced background noise resistance
	- 🚫 Reduced number of hallucinations, especially in non-speech segments

	### Supported Tasks

	- Automatic Speech Recognition (ASR):
	- 🇷🇺 Russian (primary focus)
	- 🇬🇧 English
	- Speech Translation:
	- Russian ↔️ English
	- Speech Language Detection (including non-speech detection)

	## Uses

	### Installation

	Whisper-Podlodka-Turbo is supported in Hugging Face 🤗 [Transformers](https://huggingface.co/docs/transformers/index). To run the model, first install the Transformers library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and 🤗 Accelerate to reduce the model loading time:

	```bash
	pip install --upgrade pip
	pip install --upgrade transformers datasets[audio] accelerate
	```

	Also, I recommend using [`whisper-lid`](https://github.com/bond005/whisper-lid) for initial spoken language detection. Therefore, this library is also worth installing:

	```bash
	pip install --upgrade whisper-lid
	```

	### Usages Cases

	#### Speech recognition

	The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) class to transcribe audios of arbitrary language:

	```python
	import librosa # for loading sound from local file
	from transformers import pipeline # for working with Whisper-Podlodka-Turbo
	import wget # for downloading demo sound from its URL
	from whisper_lid.whisper_lid import detect_language_in_speech # for spoken language detection

	model_id = "bond005/whisper-podlodka-turbo" # the best Whisper model :-)
	target_sampling_rate = 16_000 # Hz

	asr = pipeline(model=model_id, device_map='auto', torch_dtype='auto')

	# An example of speech recognition in Russian, spoken by a native speaker of this language
	sound_ru_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_ru.wav'
	sound_ru_name = wget.download(sound_ru_url)
	sound_ru = librosa.load(sound_ru_name, sr=target_sampling_rate, mono=True)[0]
	print('Duration of sound with Russian speech = {0:.3f} seconds.'.format(
	sound_ru.shape[0] / target_sampling_rate
	))
	detected_languages = detect_language_in_speech(
	sound_ru,
	asr.feature_extractor,
	asr.tokenizer,
	asr.model
	)
	print('Top-3 languages:')
	lang_text_width = max([len(it[0]) for it in detected_languages])
	for it in detected_languages[0:3]:
	print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
	recognition_result = asr(
	sound_ru,
	generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]},
	return_timestamps=False
	)
	print(recognition_result['text'] + '\n')

	# An example of speech recognition in English, pronounced by a non-native speaker of that language with an accent
	sound_en_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_en.wav'
	sound_en_name = wget.download(sound_en_url)
	sound_en = librosa.load(sound_en_name, sr=target_sampling_rate, mono=True)[0]
	print('Duration of sound with English speech = {0:.3f} seconds.'.format(
	sound_en.shape[0] / target_sampling_rate
	))
	detected_languages = detect_language_in_speech(
	sound_en,
	asr.feature_extractor,
	asr.tokenizer,
	asr.model
	)
	print('Top-3 languages:')
	lang_text_width = max([len(it[0]) for it in detected_languages])
	for it in detected_languages[0:3]:
	print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
	recognition_result = asr(
	sound_en,
	generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]},
	return_timestamps=False
	)
	print(recognition_result['text'] + '\n')
	```

	As a result, you can see a text output like this:

	```text
	Duration of sound with Russian speech = 29.947 seconds.
	Top-3 languages:
	russian 0.9568
	english 0.0372
	ukrainian 0.0013
	Ну, виспер сам по себе. Что такое виспер? Виспер — это уже полноценное end-to-end нейросетевое решение с авторегрессионным декодером, то есть это не чистый энкодер, как Wave2Vec, это не просто текстовый сек-то-сек, энкодер-декодер, как T5, это полноценный алгоритм преобразования речи в текст, где энкодер учитывает, прежде всего, акустические фичи речи, ну и семантика тоже постепенно подмешивается, а декодер — это уже языковая модель, которая генерирует токен за токеном.

	Duration of sound with English speech = 20.247 seconds.
	Top-3 languages:
	english 0.9526
	russian 0.0311
	polish 0.0006
	Ensembling can help us to solve a well-known bias-variance trade-off. We can decrease variance on basis of large ensemble, large ensemble of different algorithms.

	```

	#### Speech recognition with timestamps

	In addition to the usual recognition, the model can also provide timestamps for recognized speech fragments:

	```python
	recognition_result = asr(
	sound_ru,
	generate_kwargs={'task': 'transcribe', 'language': 'russian',
	return_timestamps=True
	)
	print('Recognized chunks of Russian speech:')
	for it in recognition_result['chunks']:
	print(f' {it}')

	recognition_result = asr(
	sound_en,
	generate_kwargs={'task': 'transcribe', 'language': 'english',
	return_timestamps=True
	)
	print('\nRecognized chunks of English speech:')
	for it in recognition_result['chunks']:
	print(f' {it}')
	```

	As a result, you can see a text output like this:

	```text
	Recognized chunks of Russian speech:
	{'timestamp': (0.0, 4.8), 'text': 'Ну, виспер, сам по себе, что такое виспер. Виспер — это уже полноценное'}
	{'timestamp': (4.8, 8.4), 'text': ' end-to-end нейросетевое решение с авторегрессионным декодером.'}
	{'timestamp': (8.4, 10.88), 'text': ' То есть, это не чистый энкодер, как Wave2Vec.'}
	{'timestamp': (10.88, 15.6), 'text': ' Это не просто текстовый сек-то-сек, энкодер-декодер, как T5.'}
	{'timestamp': (15.6, 19.12), 'text': ' Это полноценный алгоритм преобразования речи в текст,'}
	{'timestamp': (19.12, 23.54), 'text': ' где энкодер учитывает, прежде всего, акустические фичи речи,'}
	{'timestamp': (23.54, 25.54), 'text': ' ну и семантика тоже постепенно подмешивается,'}
	{'timestamp': (25.54, 29.94), 'text': ' а декодер — это уже языковая модель, которая генерирует токен за токеном.'}

	Recognized chunks of English speech:
	{'timestamp': (0.0, 8.08), 'text': 'Ensembling can help us to solve a well-known bias-variance trade-off.'}
	{'timestamp': (8.96, 20.08), 'text': 'We can decrease variance on basis of large ensemble, large ensemble of different algorithms.'}
	```

	#### Voice activity detection (speech/non-speech)

	Along with special language tokens, the model can also return the special token `<\|nospeech\|>`, if the input audio signal does not contain any speech (for details, see section 2.3 of the [corresponding paper about Whisper](https://arxiv.org/pdf/2212.04356)). This skill of the model forms the basis of the speech/non-speech classification algorithm, as demonstrated in the following example:

	```python
	nonspeech_sound_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_nonspeech.wav'
	nonspeech_sound_name = wget.download(nonspeech_sound_url)
	nonspeech_sound = librosa.load(nonspeech_sound_name, sr=target_sampling_rate, mono=True)[0]
	print('Duration of sound without speech = {0:.3f} seconds.'.format(
	nonspeech_sound.shape[0] / target_sampling_rate
	))
	detected_languages = detect_language_in_speech(
	nonspeech_sound,
	asr.feature_extractor,
	asr.tokenizer,
	asr.model
	)
	print('Top-3 languages:')
	lang_text_width = max([len(it[0]) for it in detected_languages])
	for it in detected_languages[0:3]:
	print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
	```

	As a result, you can see a text output like this:

	```text
	Duration of sound without speech = 10.000 seconds.
	Top-3 languages:
	NO SPEECH 0.9957
	lingala 0.0002
	cantonese 0.0002
	```

	#### Speech translation

	In addition to the transcription task, the model also performs speech translation (although it translates better from Russian into English than from English into Russian):

	```python
	print(f'Speech translation from Russian to English:')
	recognition_result = asr(
	sound_ru,
	generate_kwargs={'task': 'translate', 'language': 'english'},
	return_timestamps=False
	)
	print(recognition_result['text'] + '\n')

	print(f'Speech translation from English to Russian:')
	recognition_result = asr(
	sound_en,
	generate_kwargs={'task': 'translate', 'language': 'russian'},
	return_timestamps=False
	)
	print(recognition_result['text'] + '\n')
	```

	As a result, you can see a text output like this:

	```text
	Speech translation from Russian to English:
	Well, Visper, what is Visper? Visper is already a complete end-to-end neural network with an autoregressive decoder. That is, it's not a pure encoder like Wave2Vec, it's not just a text-to-seq encoder-decoder like T5, it's a complete algorithm for the transformation of speech into text, where the encoder considers, first of all, acoustic features of speech, well, and the semantics are also gradually moving, and the decoder is already a language model that generates token by token.

	Speech translation from English to Russian:
	Энсемблинг может помочь нам осуществлять хорошо известный торговый байз-вариант. Мы можем ограничить варианты на основе крупного энсембла, крупного энсембла разных алгоритмов.
	```

	As you can see, in both examples the speech translation contains some errors, however in the example of translation from English to Russian these errors are more significant.

	## Bias, Risks, and Limitations

	- While improvements are observed for English and translation tasks, statistically significant advantages are confirmed only for Russian ASR
	- The model's performance on [code-switching speech](https://en.wikipedia.org/wiki/Code-switching) (where speakers alternate between Russian and English within the same utterance) has not been specifically evaluated
	- Inherits basic limitations of the Whisper architecture

	## Training Details

	### Training Data

	The model was fine-tuned on a composite dataset including:
	- [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) (Ru, En)
	- [Podlodka Speech](https://huggingface.co/datasets/bond005/podlodka_speech) (Ru)
	- [Taiga Speech](https://huggingface.co/datasets/bond005/taiga_speech_v2) (Ru, synthetic)
	- [Golos Farfield](https://huggingface.co/datasets/bond005/sberdevices_golos_100h_farfield) and [Golos Crowd](https://huggingface.co/datasets/bond005/sberdevices_golos_10h_crowd) (Ru)
	- [Sova Rudevices](https://huggingface.co/datasets/bond005/sova_rudevices) (Ru)
	- [Audioset](https://huggingface.co/datasets/bond005/audioset-nonspeech) (non-speech audio)

	### Training Features

	1. Data Augmentation:
	- Dynamic mixing of speech with background noise and music
	- Gradual reduction of signal-to-noise ratio during training

	2. Text Data Processing:
	- Russian text punctuation and capitalization restoration using [bond005/ruT5-ASR-large](https://huggingface.co/bond005/ruT5-ASR-large) (for speech sub-corpora without punctuated annotations)
	- Parallel Russian-English text generation using [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
	- Multi-stage validation of generated texts to minimize hallucinations using [bond005/xlm-roberta-xl-hallucination-detector](https://huggingface.co/bond005/xlm-roberta-xl-hallucination-detector)

	3. Training Strategy:
	- Progressive increase in training example complexity
	- Balanced sampling between speech and non-speech data
	- Special handling of language tokens and no-speech detection (`<\|nospeech\|>`)

	## Evaluation

	The experimental evaluation focused on two main tasks:

	1. Russian speech recognition
	2. Speech activity detection (binary classification "speech/non-speech")

	Testing was performed on publicly available Russian speech corpora. Speech recognition was conducted using [the standard pipeline](https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) from the Hugging Face 🤗 [Transformers library](https://huggingface.co/docs/transformers/index). Due to the limitations of this pipeline in language identification and non-speech detection (caused by a certain bug), the [whisper-lid](https://github.com/bond005/whisper-lid) library was used for speech presence/absence detection in the signal.

	### Testing Data & Metrics

	#### Testing Data

	The quality of the Russian speech recognition task was tested on test sub-sets of six different datasets:

	- [Common Voice 11 Ru](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
	- [Podlodka Speech](https://huggingface.co/datasets/bond005/podlodka_speech)
	- [Golos Farfield](https://huggingface.co/datasets/bond005/sberdevices_golos_100h_farfield)
	- [Golos Crowd](https://huggingface.co/datasets/bond005/sberdevices_golos_10h_crowd)
	- [Sova Rudevices](https://huggingface.co/datasets/bond005/sova_rudevices)
	- [Russian Librispeech](https://huggingface.co/datasets/bond005/rulibrispeech)

	The quality of the voice activity detection task was tested on test sub-sets of two different datasets:

	- [noised version of Golos Crowd](https://huggingface.co/datasets/bond005/sberdevices_golos_10h_crowd_noised_2db) as a source of speech samples
	- [filtered sub-set of Audioset corpus](https://huggingface.co/datasets/bond005/audioset-nonspeech) as a source of non-speech samples

	Noise was added using [a special augmenter](https://github.com/dangrebenkin/audio_augmentator) capable of simulating the superposition of five different types of acoustic noise (reverberation, speech-like sounds, music, household sounds, and pet sounds) at a given signal-to-noise ratio (in this case, a signal-to-noise ratio of 2 dB was used).

	The quality of the robust Russian speech recognition task was tested on test sub-set of above-mentioned [noised Golos Crowd](https://huggingface.co/datasets/bond005/sberdevices_golos_10h_crowd_noised_2db).

	#### Metrics

	1. Modified [WER (Word Error Rate)](https://en.wikipedia.org/wiki/Word_error_rate) for Russian speech recognition quality:
	- Text normalization before WER calculation:
	- Unification of numeral representations (digits/words)
	- Standardization of foreign words (Cyrillic/Latin scripts)
	- Accounting for valid transliteration variants
	- Enables more accurate assessment of semantic recognition accuracy
	- The lower the WER, the better the speech recognition quality

	2. [F1-score](https://en.wikipedia.org/wiki/F-score) for speech activity detection:
	- Binary classification "speech/non-speech"
	- Evaluation of non-speech segment detection accuracy using `<\|nospeech\|>` token
	- The higher the F1 score, the better the voice activity detection quality

	### Results

	#### Automatic Speech Recognition (ASR)

	Result (WER, %):

	\| Dataset \| bond005/whisper-podlodka-turbo \| openai/whisper-large-v3-turbo \|
	\|----------------------------\|--------------------------------\|-------------------------------\|
	\| bond005/podlodka_speech \| 7.81 \| 8.33 \|
	\| rulibrispeech \| 9.61 \| 10.25 \|
	\| sberdevices_golos_farfield \| 11.26 \| 20.12 \|
	\| sberdevices_golos_crowd \| 11.82 \| 14.55 \|
	\| sova_rudevices \| 15.26 \| 17.70 \|
	\| common_voice_11_0 \| 5.22 \| 6.63 \|

	#### Voice Activity Detection (VAD)

	Result (F1):

	\| bond005/whisper-podlodka-turbo \| openai/whisper-large-v3-turbo \|
	\|--------------------------------\|-------------------------------\|
	\| 0.9214 \| 0.8484 \|

	#### Robust ASR (SNR = 2 dB, speech-like noise, music, etc.)

	Result (WER, %):

	\| Dataset \| bond005/whisper-podlodka-turbo \| openai/whisper-large-v3-turbo \|
	\|----------------------------------\|--------------------------------\|-------------------------------\|
	\| sberdevices_golos_crowd (noised) \| 46.14 \| 75.20 \|

	## Citation

	If you use this model in your work, please cite it as:

	```bibtex
	@misc{whisper-podlodka-turbo,
	author = {Ivan Bondarenko},
	title = {Whisper-Podlodka-Turbo: Enhanced Whisper Model for Russian ASR},
	year = {2025},
	publisher = {Hugging Face},
	journal = {Hugging Face Model Hub},
	howpublished = {\url{https://huggingface.co/bond005/whisper-podlodka-turbo}}
	}
	```