File size: 7,207 Bytes
6cd14d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c52cd50
 
 
 
 
 
 
6cd14d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14adf6d
 
2557c6d
 
6cd14d8
 
 
 
 
 
 
 
64c1387
6cd14d8
57e4e99
6cd14d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57e4e99
 
 
 
 
 
 
 
 
 
6cd14d8
 
 
 
57e4e99
 
 
 
 
 
 
 
 
 
6cd14d8
57e4e99
6cd14d8
 
 
 
 
a478c0c
6cd14d8
 
 
 
 
 
 
1a9a95c
6cd14d8
23ef131
6cd14d8
a478c0c
 
1a9a95c
a478c0c
 
 
 
 
 
6cd14d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
799dcb3
6cd14d8
 
 
 
2557c6d
6cd14d8
2557c6d
6cd14d8
 
 
 
 
57e4e99
18cdbdf
a478c0c
9c434ac
6cd14d8
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
---
license: apache-2.0
base_model:
- openai/whisper-large-v3
base_model_relation: quantized
pipeline_tag: automatic-speech-recognition
language:
- en
- zh
- de
- es
- ru
- ko
- fr
- ja
- pt
- tr
- pl
- ca
- nl
- ar
- sv
- it
- id
- hi
- fi
- vi
- he
- uk
- el
- ms
- cs
- ro
- da
- hu
- ta
- no
- th
- ur
- hr
- bg
- lt
- la
- mi
- ml
- cy
- sk
- te
- fa
- lv
- bn
- sr
- az
- sl
- kn
- et
- mk
- br
- eu
- is
- hy
- ne
- mn
- bs
- kk
- sq
- sw
- gl
- mr
- pa
- si
- km
- sn
- yo
- so
- af
- oc
- ka
- be
- tg
- sd
- gu
- am
- yi
- lo
- uz
- fo
- ht
- ps
- tk
- nn
- mt
- sa
- lb
- my
- bo
- tl
- mg
- as
- tt
- haw
- ln
- ha
- ba
- jw
- su
- yue
tags:
- audio
- automatic-speech-recognition
- speech-recognition
- whisper
- annthem
- qlip
- thestage
---

# Elastic model: Whisper Large v3. Fastest and most flexible models for self-serving.

Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:

* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler. 

* __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.

* __M__: Faster model, with accuracy degradation less than 1.5%.

* __S__: The fastest model, with accuracy degradation less than 2%.

__Goals of elastic models:__

* Provide flexibility in cost vs quality selection for inference
* Provide clear quality and latency benchmarks for speech recognition
* Provide interface of HF libraries: `transformers` and `elastic_models` with a single line of code change for using optimized versions
* Provide models supported on a wide range of hardware (NVIDIA GPUs), which are pre-compiled and require no JIT
* Provide the best models and service for self-hosting

> It's important to note that we have consolidated all elastic model versions into a single optimized S model that provides the best balance of speed and quality for Whisper Large v3.


## Audio Examples

Below are examples demonstrating the transcription quality of the Elastic Whisper Large v3 S model compared to the original. 

**Example Audio Transcriptions:**

| Audio Sample | Original Whisper Large v3 | Elastic S Model |
|---|---|---|
| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/io62uN1l-tpqigMlzQMlm.mpga"></audio> | joel keaton disapproved of films and buster also had reservations about the medium | joel keaton disapproved of films and buster also had reservations about the medium |
| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/CVabXfIP_Q5qxIjzoy5N6.mpga"></audio> | she ll be alright | she ll be alright |
| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/-fidVnQcCa32c7-2rNz-w.mpga"></audio> | all is well that ends well | all is well that ends well |
## Inference

To infer our Whisper models, you primarily use the `elastic_models.transformers.WhisperForConditionalGeneration` class.

**Example using `elastic_models` with the optimized model:**

```python
import torch
import librosa  # check that you have this package installed
from transformers import AutoProcessor
from transformers.pipelines import pipeline
from elastic_models.transformers import WhisperForConditionalGeneration

model_name = "openai/whisper-large-v3"
mode = "S"

audio_path = "path_to_your_audio.wav"
hf_token = "YOUR_TOKEN"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load processor and model
processor = AutoProcessor.from_pretrained(model_name, token=hf_token)

model = WhisperForConditionalGeneration.from_pretrained(
    model_name,
    token=hf_token,
    torch_dtype=torch.float16,
    mode=mode,
    device_map=device,
)
model.eval()

# Create pipeline
generator = pipeline(
    task="automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device=device,
)

# Load audio
audio, sr = librosa.load(audio_path, sr=16000)

print(f"Transcribing audio from: {audio_path}")

# Generate transcription using pipeline
generate_kwargs = {
    "max_new_tokens": 100,
    "num_beams": 1,
}

result = generator(
    audio,
    generate_kwargs=generate_kwargs,
)

transcription = result["text"]

print(f"Transcription: {transcription}")
```

__System requirements:__
* GPUs: NVIDIA GeForce 4090, NVIDIA GeForce 5090, H100, L40S
* CPU: AMD, Intel
* Python: 3.8-3.12 (check dependencies for specific versions)

To work with our elastic models and compilation tools, you'll need to install `elastic_models` and `qlip` libraries from TheStage:

```shell
pip install thestage
pip install 'thestage-elastic-models[nvidia]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install flash-attn==2.7.3 --no-build-isolation
pip install tensorrt==10.11.0.33 # for 4090
pip uninstall apex

# or for blackwell support
pip install 'thestage-elastic-models[blackwell]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# please download the appropriate version of Wheels for your system from https://github.com/Zarrac/flashattention-blackwell-wheels-whl-ONLY-5090-5080-5070-5060-flash-attention-/releases/tag/FlashAttention
mv flash_attn-2.7.4.post1-rtx5090-torch2.7.0cu128cxx11abiTRUE-cp311-linux_x86_64.whl flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip install flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip install tensorrt==10.11.0.33
pip uninstall apex
```

Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:

```shell
thestage config set --api-token <YOUR_API_TOKEN>
```

Congrats, now you can use accelerated models and tools!

----

## Benchmarks

Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for Whisper models using our algorithms.

### Quality benchmarks

Performance evaluation on standard speech recognition benchmarks:

| Metric/Model | S | Original |
|--------------|---|----------|
| WER (Common Voice) | 0.18 | 0.22 |

* **WER (Word Error Rate)**: The primary metric for evaluating speech recognition accuracy. Lower is better.
* **Common Voice**: Multilingual speech recognition benchmark covering diverse languages and accents.

### Latency benchmarks (tps)

Performance for transcribing audio (tps):

**Batch Size 1:**

| GPU Type | S | Original |
|----------|---|----------|
| H100 | 223.47 | 82.84 |
| L40S | 210.67 | 72.36 |
| GeForce RTX 4090 | 240 | 86.63 |
| GeForce RTX 5090 | 265.93 | 195.76 |

## Links

* __Platform__: [app.thestage.ai](https://app.thestage.ai)
* __Subscribe for updates__: [TheStageAI X (Twitter)](https://x.com/TheStageAI)
* __Contact email__: [email protected]