bond005 commited on
Commit
dad47de
·
verified ·
1 Parent(s): a7c8a83

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -2
README.md CHANGED
@@ -129,11 +129,138 @@ pip install --upgrade pip
129
  pip install --upgrade transformers datasets[audio] accelerate
130
  ```
131
 
 
 
 
 
 
 
132
  ### Direct Use
133
 
134
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
 
136
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
 
138
  ### Out-of-Scope Use
139
 
 
129
  pip install --upgrade transformers datasets[audio] accelerate
130
  ```
131
 
132
+ Also, I recommend using [`whisper-lid`](https://github.com/bond005/whisper-lid) for initial spoken language detection. Therefore, this library is also worth installing:
133
+
134
+ ```bash
135
+ pip install --upgrade whisper-lid
136
+ ```
137
+
138
  ### Direct Use
139
 
140
+ #### Speech recognition
141
+
142
+ The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) class to transcribe audios of arbitrary language:
143
+
144
+ ```python
145
+ import librosa # for loading sound from local file
146
+ from transformers import pipeline # for working with Whisper-Podlodka-Turbo
147
+ import wget # for downloading demo sound from its URL
148
+ from whisper_lid.whisper_lid import detect_language_in_speech # for spoken language detection
149
+
150
+ model_id = "openai/whisper-podlodks-turbo" # the best Whisper model :-)
151
+ target_sampling_rate = 16_000 # Hz
152
+
153
+ asr = pipeline(model=model_id, device_map='auto', torch_dtype='auto')
154
+
155
+ # An example of speech recognition in Russian, spoken by a native speaker of this language
156
+ sound_ru_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_ru.wav'
157
+ sound_ru_name = wget.download(sound_ru_url)
158
+ sound_ru = librosa.load(sound_ru_name, sr=TARGET_SR, mono=True)[0]
159
+ print('Duration of sound with Russian speech = {0:.3f} seconds.'.format(
160
+ sound_ru.shape[0] / target_sampling_rate
161
+ ))
162
+ detected_languages = detect_language_in_speech(
163
+ sound_ru,
164
+ asr.feature_extractor,
165
+ asr.tokenizer,
166
+ asr.model
167
+ )
168
+ print('Top-3 languages:')
169
+ lang_text_width = max([len(it[0]) for it in detected_languages])
170
+ for it in detected_languages[0:3]:
171
+ print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
172
+ recognition_result = asr(
173
+ sound_ru,
174
+ generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]},
175
+ return_timestamps=False
176
+ )
177
+ print(recognition_result['text'] + '\n')
178
+
179
+ # An example of speech recognition in English, pronounced by a non-native speaker of that language with an accent
180
+ sound_en_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_en.wav'
181
+ sound_en_name = wget.download(sound_en_url)
182
+ sound_en = librosa.load(sound_en_name, sr=TARGET_SR, mono=True)[0]
183
+ print('Duration of sound with English speech = {0:.3f} seconds.'.format(
184
+ sound_en.shape[0] / target_sampling_rate
185
+ ))
186
+ detected_languages = detect_language_in_speech(
187
+ sound_en,
188
+ asr.feature_extractor,
189
+ asr.tokenizer,
190
+ asr.model
191
+ )
192
+ print('Top-3 languages:')
193
+ lang_text_width = max([len(it[0]) for it in detected_languages])
194
+ for it in detected_languages[0:3]:
195
+ print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
196
+ recognition_result = asr(
197
+ sound_en,
198
+ generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]},
199
+ return_timestamps=False
200
+ )
201
+ print(recognition_result['text'] + '\n')
202
+ ```
203
 
204
+ As a result, you should see a text output like this:
205
+
206
+ ```text
207
+ Duration of sound with Russian speech = 29.947 seconds.
208
+ Top-3 languages:
209
+ russian 0.9568
210
+ english 0.0372
211
+ ukrainian 0.0013
212
+ Ну, виспер сам по себе. Что такое виспер? Виспер — это уже полноценное end-to-end нейросетевое решение с авторегрессионным декодером, то есть это не чистый энкодер, как Wave2Vec, это не просто текстовый сек-то-сек, энкодер-декодер, как T5, это полноценный алгоритм преобразования речи в текст, где энкодер учитывает, прежде всего, акустические фичи речи, ну и семантика тоже постепенно подмешивается, а декодер — это уже языковая модель, которая генерирует токен за токеном.
213
+
214
+ Duration of sound with English speech = 20.247 seconds.
215
+ Top-3 languages:
216
+ english 0.9526
217
+ russian 0.0311
218
+ polish 0.0006
219
+ Ensembling can help us to solve a well-known bias-variance trade-off. We can decrease variance on basis of large ensemble, large ensemble of different algorithms.
220
+
221
+ ```
222
+
223
+ #### Speech recognition with timestamps
224
+
225
+ In addition to the usual recognition, the model can also provide timestamps for recognized speech fragments:
226
+
227
+ ```python
228
+ recognition_result = asr(
229
+ sound_ru,
230
+ generate_kwargs={'task': 'transcribe', 'language': 'russian',
231
+ return_timestamps=True
232
+ )
233
+ print('Recognized chunks of Russian speech:')
234
+ for it in recognition_result['chunks']:
235
+ print(f' {it}')
236
+
237
+ recognition_result = asr(
238
+ sound_en,
239
+ generate_kwargs={'task': 'transcribe', 'language': 'english',
240
+ return_timestamps=True
241
+ )
242
+ print('\nRecognized chunks of English speech:')
243
+ for it in recognition_result['chunks']:
244
+ print(f' {it}')
245
+ ```
246
+
247
+ As a result, you should see a text output like this:
248
+
249
+ ```text
250
+ Recognized chunks of Russian speech:
251
+ {'timestamp': (0.0, 4.8), 'text': 'Ну, виспер, сам по себе, что такое виспер. Виспер — это уже полноценное'}
252
+ {'timestamp': (4.8, 8.4), 'text': ' end-to-end нейросетевое решение с авторегрессионным декодером.'}
253
+ {'timestamp': (8.4, 10.88), 'text': ' То есть, это не чистый энкодер, как Wave2Vec.'}
254
+ {'timestamp': (10.88, 15.6), 'text': ' Это не просто текстовый сек-то-сек, энкодер-декодер, как T5.'}
255
+ {'timestamp': (15.6, 19.12), 'text': ' Это полноценный алгоритм преобразования речи в текст,'}
256
+ {'timestamp': (19.12, 23.54), 'text': ' где энкодер учитывает, прежде всего, акустические фичи речи,'}
257
+ {'timestamp': (23.54, 25.54), 'text': ' ну и семантика тоже постепенно подмешивается,'}
258
+ {'timestamp': (25.54, 29.94), 'text': ' а декодер — это уже языковая модель, которая генерирует токен за токеном.'}
259
+
260
+ Recognized chunks of English speech:
261
+ {'timestamp': (0.0, 8.08), 'text': 'Ensembling can help us to solve a well-known bias-variance trade-off.'}
262
+ {'timestamp': (8.96, 20.08), 'text': 'We can decrease variance on basis of large ensemble, large ensemble of different algorithms.'}
263
+ ```
264
 
265
  ### Out-of-Scope Use
266