File size: 7,207 Bytes

---
license: apache-2.0
base_model:
- openai/whisper-large-v3
base_model_relation: quantized
pipeline_tag: automatic-speech-recognition
language:
- en
- zh
- de
- es
- ru
- ko
- fr
- ja
- pt
- tr
- pl
- ca
- nl
- ar
- sv
- it
- id
- hi
- fi
- vi
- he
- uk
- el
- ms
- cs
- ro
- da
- hu
- ta
- no
- th
- ur
- hr
- bg
- lt
- la
- mi
- ml
- cy
- sk
- te
- fa
- lv
- bn
- sr
- az
- sl
- kn
- et
- mk
- br
- eu
- is
- hy
- ne
- mn
- bs
- kk
- sq
- sw
- gl
- mr
- pa
- si
- km
- sn
- yo
- so
- af
- oc
- ka
- be
- tg
- sd
- gu
- am
- yi
- lo
- uz
- fo
- ht
- ps
- tk
- nn
- mt
- sa
- lb
- my
- bo
- tl
- mg
- as
- tt
- haw
- ln
- ha
- ba
- jw
- su
- yue
tags:
- audio
- automatic-speech-recognition
- speech-recognition
- whisper
- annthem
- qlip
- thestage
---

# Elastic model: Whisper Large v3. Fastest and most flexible models for self-serving.

Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:

* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler. 

* __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.

* __M__: Faster model, with accuracy degradation less than 1.5%.

* __S__: The fastest model, with accuracy degradation less than 2%.

__Goals of elastic models:__

* Provide flexibility in cost vs quality selection for inference
* Provide clear quality and latency benchmarks for speech recognition
* Provide interface of HF libraries: `transformers` and `elastic_models` with a single line of code change for using optimized versions
* Provide models supported on a wide range of hardware (NVIDIA GPUs), which are pre-compiled and require no JIT
* Provide the best models and service for self-hosting

> It's important to note that we have consolidated all elastic model versions into a single optimized S model that provides the best balance of speed and quality for Whisper Large v3.


## Audio Examples

Below are examples demonstrating the transcription quality of the Elastic Whisper Large v3 S model compared to the original. 

**Example Audio Transcriptions:**

| Audio Sample | Original Whisper Large v3 | Elastic S Model |
|---|---|---|
| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/io62uN1l-tpqigMlzQMlm.mpga"></audio> | joel keaton disapproved of films and buster also had reservations about the medium | joel keaton disapproved of films and buster also had reservations about the medium |
| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/CVabXfIP_Q5qxIjzoy5N6.mpga"></audio> | she ll be alright | she ll be alright |
| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/-fidVnQcCa32c7-2rNz-w.mpga"></audio> | all is well that ends well | all is well that ends well |
## Inference

To infer our Whisper models, you primarily use the `elastic_models.transformers.WhisperForConditionalGeneration` class.

**Example using `elastic_models` with the optimized model:**

```python
import torch
import librosa  # check that you have this package installed
from transformers import AutoProcessor
from transformers.pipelines import pipeline
from elastic_models.transformers import WhisperForConditionalGeneration

model_name = "openai/whisper-large-v3"
mode = "S"

audio_path = "path_to_your_audio.wav"
hf_token = "YOUR_TOKEN"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load processor and model
processor = AutoProcessor.from_pretrained(model_name, token=hf_token)

model = WhisperForConditionalGeneration.from_pretrained(
    model_name,
    token=hf_token,
    torch_dtype=torch.float16,
    mode=mode,
    device_map=device,
)
model.eval()

# Create pipeline
generator = pipeline(
    task="automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device=device,
)

# Load audio
audio, sr = librosa.load(audio_path, sr=16000)

print(f"Transcribing audio from: {audio_path}")

# Generate transcription using pipeline
generate_kwargs = {
    "max_new_tokens": 100,
    "num_beams": 1,
}

result = generator(
    audio,
    generate_kwargs=generate_kwargs,
)

transcription = result["text"]

print(f"Transcription: {transcription}")
```

__System requirements:__
* GPUs: NVIDIA GeForce 4090, NVIDIA GeForce 5090, H100, L40S
* CPU: AMD, Intel
* Python: 3.8-3.12 (check dependencies for specific versions)

To work with our elastic models and compilation tools, you'll need to install `elastic_models` and `qlip` libraries from TheStage:

```shell
pip install thestage
pip install 'thestage-elastic-models[nvidia]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install flash-attn==2.7.3 --no-build-isolation
pip install tensorrt==10.11.0.33 # for 4090
pip uninstall apex

# or for blackwell support
pip install 'thestage-elastic-models[blackwell]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# please download the appropriate version of Wheels for your system from https://github.com/Zarrac/flashattention-blackwell-wheels-whl-ONLY-5090-5080-5070-5060-flash-attention-/releases/tag/FlashAttention
mv flash_attn-2.7.4.post1-rtx5090-torch2.7.0cu128cxx11abiTRUE-cp311-linux_x86_64.whl flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip install flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip install tensorrt==10.11.0.33
pip uninstall apex
```

Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:

```shell
thestage config set --api-token <YOUR_API_TOKEN>
```

Congrats, now you can use accelerated models and tools!

----

## Benchmarks

Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for Whisper models using our algorithms.

### Quality benchmarks

Performance evaluation on standard speech recognition benchmarks:

| Metric/Model | S | Original |
|--------------|---|----------|
| WER (Common Voice) | 0.18 | 0.22 |

* **WER (Word Error Rate)**: The primary metric for evaluating speech recognition accuracy. Lower is better.
* **Common Voice**: Multilingual speech recognition benchmark covering diverse languages and accents.

### Latency benchmarks (tps)

Performance for transcribing audio (tps):

**Batch Size 1:**

| GPU Type | S | Original |
|----------|---|----------|
| H100 | 223.47 | 82.84 |
| L40S | 210.67 | 72.36 |
| GeForce RTX 4090 | 240 | 86.63 |
| GeForce RTX 5090 | 265.93 | 195.76 |

## Links

* __Platform__: [app.thestage.ai](https://app.thestage.ai)
* __Subscribe for updates__: [TheStageAI X (Twitter)](https://x.com/TheStageAI)
* __Contact email__: [email protected]