Update README.md
Browse files
README.md
CHANGED
@@ -129,11 +129,138 @@ pip install --upgrade pip
|
|
129 |
pip install --upgrade transformers datasets[audio] accelerate
|
130 |
```
|
131 |
|
|
|
|
|
|
|
|
|
|
|
|
|
132 |
### Direct Use
|
133 |
|
134 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
135 |
|
136 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
137 |
|
138 |
### Out-of-Scope Use
|
139 |
|
|
|
129 |
pip install --upgrade transformers datasets[audio] accelerate
|
130 |
```
|
131 |
|
132 |
+
Also, I recommend using [`whisper-lid`](https://github.com/bond005/whisper-lid) for initial spoken language detection. Therefore, this library is also worth installing:
|
133 |
+
|
134 |
+
```bash
|
135 |
+
pip install --upgrade whisper-lid
|
136 |
+
```
|
137 |
+
|
138 |
### Direct Use
|
139 |
|
140 |
+
#### Speech recognition
|
141 |
+
|
142 |
+
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) class to transcribe audios of arbitrary language:
|
143 |
+
|
144 |
+
```python
|
145 |
+
import librosa # for loading sound from local file
|
146 |
+
from transformers import pipeline # for working with Whisper-Podlodka-Turbo
|
147 |
+
import wget # for downloading demo sound from its URL
|
148 |
+
from whisper_lid.whisper_lid import detect_language_in_speech # for spoken language detection
|
149 |
+
|
150 |
+
model_id = "openai/whisper-podlodks-turbo" # the best Whisper model :-)
|
151 |
+
target_sampling_rate = 16_000 # Hz
|
152 |
+
|
153 |
+
asr = pipeline(model=model_id, device_map='auto', torch_dtype='auto')
|
154 |
+
|
155 |
+
# An example of speech recognition in Russian, spoken by a native speaker of this language
|
156 |
+
sound_ru_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_ru.wav'
|
157 |
+
sound_ru_name = wget.download(sound_ru_url)
|
158 |
+
sound_ru = librosa.load(sound_ru_name, sr=TARGET_SR, mono=True)[0]
|
159 |
+
print('Duration of sound with Russian speech = {0:.3f} seconds.'.format(
|
160 |
+
sound_ru.shape[0] / target_sampling_rate
|
161 |
+
))
|
162 |
+
detected_languages = detect_language_in_speech(
|
163 |
+
sound_ru,
|
164 |
+
asr.feature_extractor,
|
165 |
+
asr.tokenizer,
|
166 |
+
asr.model
|
167 |
+
)
|
168 |
+
print('Top-3 languages:')
|
169 |
+
lang_text_width = max([len(it[0]) for it in detected_languages])
|
170 |
+
for it in detected_languages[0:3]:
|
171 |
+
print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
|
172 |
+
recognition_result = asr(
|
173 |
+
sound_ru,
|
174 |
+
generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]},
|
175 |
+
return_timestamps=False
|
176 |
+
)
|
177 |
+
print(recognition_result['text'] + '\n')
|
178 |
+
|
179 |
+
# An example of speech recognition in English, pronounced by a non-native speaker of that language with an accent
|
180 |
+
sound_en_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_en.wav'
|
181 |
+
sound_en_name = wget.download(sound_en_url)
|
182 |
+
sound_en = librosa.load(sound_en_name, sr=TARGET_SR, mono=True)[0]
|
183 |
+
print('Duration of sound with English speech = {0:.3f} seconds.'.format(
|
184 |
+
sound_en.shape[0] / target_sampling_rate
|
185 |
+
))
|
186 |
+
detected_languages = detect_language_in_speech(
|
187 |
+
sound_en,
|
188 |
+
asr.feature_extractor,
|
189 |
+
asr.tokenizer,
|
190 |
+
asr.model
|
191 |
+
)
|
192 |
+
print('Top-3 languages:')
|
193 |
+
lang_text_width = max([len(it[0]) for it in detected_languages])
|
194 |
+
for it in detected_languages[0:3]:
|
195 |
+
print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
|
196 |
+
recognition_result = asr(
|
197 |
+
sound_en,
|
198 |
+
generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]},
|
199 |
+
return_timestamps=False
|
200 |
+
)
|
201 |
+
print(recognition_result['text'] + '\n')
|
202 |
+
```
|
203 |
|
204 |
+
As a result, you should see a text output like this:
|
205 |
+
|
206 |
+
```text
|
207 |
+
Duration of sound with Russian speech = 29.947 seconds.
|
208 |
+
Top-3 languages:
|
209 |
+
russian 0.9568
|
210 |
+
english 0.0372
|
211 |
+
ukrainian 0.0013
|
212 |
+
Ну, виспер сам по себе. Что такое виспер? Виспер — это уже полноценное end-to-end нейросетевое решение с авторегрессионным декодером, то есть это не чистый энкодер, как Wave2Vec, это не просто текстовый сек-то-сек, энкодер-декодер, как T5, это полноценный алгоритм преобразования речи в текст, где энкодер учитывает, прежде всего, акустические фичи речи, ну и семантика тоже постепенно подмешивается, а декодер — это уже языковая модель, которая генерирует токен за токеном.
|
213 |
+
|
214 |
+
Duration of sound with English speech = 20.247 seconds.
|
215 |
+
Top-3 languages:
|
216 |
+
english 0.9526
|
217 |
+
russian 0.0311
|
218 |
+
polish 0.0006
|
219 |
+
Ensembling can help us to solve a well-known bias-variance trade-off. We can decrease variance on basis of large ensemble, large ensemble of different algorithms.
|
220 |
+
|
221 |
+
```
|
222 |
+
|
223 |
+
#### Speech recognition with timestamps
|
224 |
+
|
225 |
+
In addition to the usual recognition, the model can also provide timestamps for recognized speech fragments:
|
226 |
+
|
227 |
+
```python
|
228 |
+
recognition_result = asr(
|
229 |
+
sound_ru,
|
230 |
+
generate_kwargs={'task': 'transcribe', 'language': 'russian',
|
231 |
+
return_timestamps=True
|
232 |
+
)
|
233 |
+
print('Recognized chunks of Russian speech:')
|
234 |
+
for it in recognition_result['chunks']:
|
235 |
+
print(f' {it}')
|
236 |
+
|
237 |
+
recognition_result = asr(
|
238 |
+
sound_en,
|
239 |
+
generate_kwargs={'task': 'transcribe', 'language': 'english',
|
240 |
+
return_timestamps=True
|
241 |
+
)
|
242 |
+
print('\nRecognized chunks of English speech:')
|
243 |
+
for it in recognition_result['chunks']:
|
244 |
+
print(f' {it}')
|
245 |
+
```
|
246 |
+
|
247 |
+
As a result, you should see a text output like this:
|
248 |
+
|
249 |
+
```text
|
250 |
+
Recognized chunks of Russian speech:
|
251 |
+
{'timestamp': (0.0, 4.8), 'text': 'Ну, виспер, сам по себе, что такое виспер. Виспер — это уже полноценное'}
|
252 |
+
{'timestamp': (4.8, 8.4), 'text': ' end-to-end нейросетевое решение с авторегрессионным декодером.'}
|
253 |
+
{'timestamp': (8.4, 10.88), 'text': ' То есть, это не чистый энкодер, как Wave2Vec.'}
|
254 |
+
{'timestamp': (10.88, 15.6), 'text': ' Это не просто текстовый сек-то-сек, энкодер-декодер, как T5.'}
|
255 |
+
{'timestamp': (15.6, 19.12), 'text': ' Это полноценный алгоритм преобразования речи в текст,'}
|
256 |
+
{'timestamp': (19.12, 23.54), 'text': ' где энкодер учитывает, прежде всего, акустические фичи речи,'}
|
257 |
+
{'timestamp': (23.54, 25.54), 'text': ' ну и семантика тоже постепенно подмешивается,'}
|
258 |
+
{'timestamp': (25.54, 29.94), 'text': ' а декодер — это уже языковая модель, которая генерирует токен за токеном.'}
|
259 |
+
|
260 |
+
Recognized chunks of English speech:
|
261 |
+
{'timestamp': (0.0, 8.08), 'text': 'Ensembling can help us to solve a well-known bias-variance trade-off.'}
|
262 |
+
{'timestamp': (8.96, 20.08), 'text': 'We can decrease variance on basis of large ensemble, large ensemble of different algorithms.'}
|
263 |
+
```
|
264 |
|
265 |
### Out-of-Scope Use
|
266 |
|