bond005
/

whisper-podlodka-turbo

@@ -129,11 +129,138 @@ pip install --upgrade pip
 pip install --upgrade transformers datasets[audio] accelerate
 ```
 ### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
 ### Out-of-Scope Use

 pip install --upgrade transformers datasets[audio] accelerate
 ```
+Also, I recommend using [`whisper-lid`](https://github.com/bond005/whisper-lid) for initial spoken language detection. Therefore, this library is also worth installing:
+```bash
+pip install --upgrade whisper-lid
+```
 ### Direct Use
+#### Speech recognition
+The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) class to transcribe audios of arbitrary language:
+```python
+import librosa  # for loading sound from local file
+from transformers import pipeline  # for working with Whisper-Podlodka-Turbo
+import wget  # for downloading demo sound from its URL
+from whisper_lid.whisper_lid import detect_language_in_speech  # for spoken language detection
+model_id = "openai/whisper-podlodks-turbo"  # the best Whisper model :-)
+target_sampling_rate = 16_000  # Hz
+asr = pipeline(model=model_id, device_map='auto', torch_dtype='auto')
+# An example of speech recognition in Russian, spoken by a native speaker of this language
+sound_ru_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_ru.wav'
+sound_ru_name = wget.download(sound_ru_url)
+sound_ru = librosa.load(sound_ru_name, sr=TARGET_SR, mono=True)[0]
+print('Duration of sound with Russian speech = {0:.3f} seconds.'.format(
+    sound_ru.shape[0] / target_sampling_rate
+))
+detected_languages = detect_language_in_speech(
+    sound_ru,
+    asr.feature_extractor,
+    asr.tokenizer,
+    asr.model
+)
+print('Top-3 languages:')
+lang_text_width = max([len(it[0]) for it in detected_languages])
+for it in detected_languages[0:3]:
+    print('  {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
+recognition_result = asr(
+    sound_ru,
+    generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]},
+    return_timestamps=False
+)
+print(recognition_result['text'] + '\n')
+# An example of speech recognition in English, pronounced by a non-native speaker of that language with an accent
+sound_en_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_en.wav'
+sound_en_name = wget.download(sound_en_url)
+sound_en = librosa.load(sound_en_name, sr=TARGET_SR, mono=True)[0]
+print('Duration of sound with English speech = {0:.3f} seconds.'.format(
+    sound_en.shape[0] / target_sampling_rate
+))
+detected_languages = detect_language_in_speech(
+    sound_en,
+    asr.feature_extractor,
+    asr.tokenizer,
+    asr.model
+)
+print('Top-3 languages:')
+lang_text_width = max([len(it[0]) for it in detected_languages])
+for it in detected_languages[0:3]:
+    print('  {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
+recognition_result = asr(
+    sound_en,
+    generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]},
+    return_timestamps=False
+)
+print(recognition_result['text'] + '\n')
+```
+As a result, you should see a text output like this:
+```text
+Duration of sound with Russian speech = 29.947 seconds.
+Top-3 languages:
+         russian 0.9568
+         english 0.0372
+       ukrainian 0.0013
+Ну, виспер сам по себе. Что такое виспер? Виспер — это уже полноценное end-to-end нейросетевое решение с авторегрессионным декодером, то есть это не чистый энкодер, как Wave2Vec, это не просто текстовый сек-то-сек, энкодер-декодер, как T5, это полноценный алгоритм преобразования речи в текст, где энкодер учитывает, прежде всего, акустические фичи речи, ну и семантика тоже постепенно подмешивается, а декодер — это уже языковая модель, которая генерирует токен за токеном.
+Duration of sound with English speech = 20.247 seconds.
+Top-3 languages:
+         english 0.9526
+         russian 0.0311
+          polish 0.0006
+Ensembling can help us to solve a well-known bias-variance trade-off. We can decrease variance on basis of large ensemble, large ensemble of different algorithms.
+```
+#### Speech recognition with timestamps
+In addition to the usual recognition, the model can also provide timestamps for recognized speech fragments:
+```python
+recognition_result = asr(
+    sound_ru,
+    generate_kwargs={'task': 'transcribe', 'language': 'russian',
+    return_timestamps=True
+)
+print('Recognized chunks of Russian speech:')
+for it in recognition_result['chunks']:
+    print(f'  {it}')
+recognition_result = asr(
+    sound_en,
+    generate_kwargs={'task': 'transcribe', 'language': 'english',
+    return_timestamps=True
+)
+print('\nRecognized chunks of English speech:')
+for it in recognition_result['chunks']:
+    print(f'  {it}')
+```
+As a result, you should see a text output like this:
+```text
+Recognized chunks of Russian speech:
+  {'timestamp': (0.0, 4.8), 'text': 'Ну, виспер, сам по себе, что такое виспер. Виспер — это уже полноценное'}
+  {'timestamp': (4.8, 8.4), 'text': ' end-to-end нейросетевое решение с авторегрессионным декодером.'}
+  {'timestamp': (8.4, 10.88), 'text': ' То есть, это не чистый энкодер, как Wave2Vec.'}
+  {'timestamp': (10.88, 15.6), 'text': ' Это не просто текстовый сек-то-сек, энкодер-декодер, как T5.'}
+  {'timestamp': (15.6, 19.12), 'text': ' Это полноценный алгоритм преобразования речи в текст,'}
+  {'timestamp': (19.12, 23.54), 'text': ' где энкодер учитывает, прежде всего, акустические фичи речи,'}
+  {'timestamp': (23.54, 25.54), 'text': ' ну и семантика тоже постепенно подмешивается,'}
+  {'timestamp': (25.54, 29.94), 'text': ' а декодер — это уже языковая модель, которая генерирует токен за токеном.'}
+Recognized chunks of English speech:
+  {'timestamp': (0.0, 8.08), 'text': 'Ensembling can help us to solve a well-known bias-variance trade-off.'}
+  {'timestamp': (8.96, 20.08), 'text': 'We can decrease variance on basis of large ensemble, large ensemble of different algorithms.'}
+```
 ### Out-of-Scope Use