|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- openai/whisper-large-v3 |
|
base_model_relation: quantized |
|
pipeline_tag: automatic-speech-recognition |
|
language: |
|
- en |
|
- zh |
|
- de |
|
- es |
|
- ru |
|
- ko |
|
- fr |
|
- ja |
|
- pt |
|
- tr |
|
- pl |
|
- ca |
|
- nl |
|
- ar |
|
- sv |
|
- it |
|
- id |
|
- hi |
|
- fi |
|
- vi |
|
- he |
|
- uk |
|
- el |
|
- ms |
|
- cs |
|
- ro |
|
- da |
|
- hu |
|
- ta |
|
- no |
|
- th |
|
- ur |
|
- hr |
|
- bg |
|
- lt |
|
- la |
|
- mi |
|
- ml |
|
- cy |
|
- sk |
|
- te |
|
- fa |
|
- lv |
|
- bn |
|
- sr |
|
- az |
|
- sl |
|
- kn |
|
- et |
|
- mk |
|
- br |
|
- eu |
|
- is |
|
- hy |
|
- ne |
|
- mn |
|
- bs |
|
- kk |
|
- sq |
|
- sw |
|
- gl |
|
- mr |
|
- pa |
|
- si |
|
- km |
|
- sn |
|
- yo |
|
- so |
|
- af |
|
- oc |
|
- ka |
|
- be |
|
- tg |
|
- sd |
|
- gu |
|
- am |
|
- yi |
|
- lo |
|
- uz |
|
- fo |
|
- ht |
|
- ps |
|
- tk |
|
- nn |
|
- mt |
|
- sa |
|
- lb |
|
- my |
|
- bo |
|
- tl |
|
- mg |
|
- as |
|
- tt |
|
- haw |
|
- ln |
|
- ha |
|
- ba |
|
- jw |
|
- su |
|
- yue |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
- speech-recognition |
|
- whisper |
|
- annthem |
|
- qlip |
|
- thestage |
|
--- |
|
|
|
# Elastic model: Whisper Large v3. Fastest and most flexible models for self-serving. |
|
|
|
Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models: |
|
|
|
* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler. |
|
|
|
* __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks. |
|
|
|
* __M__: Faster model, with accuracy degradation less than 1.5%. |
|
|
|
* __S__: The fastest model, with accuracy degradation less than 2%. |
|
|
|
__Goals of elastic models:__ |
|
|
|
* Provide flexibility in cost vs quality selection for inference |
|
* Provide clear quality and latency benchmarks for speech recognition |
|
* Provide interface of HF libraries: `transformers` and `elastic_models` with a single line of code change for using optimized versions |
|
* Provide models supported on a wide range of hardware (NVIDIA GPUs), which are pre-compiled and require no JIT |
|
* Provide the best models and service for self-hosting |
|
|
|
> It's important to note that we have consolidated all elastic model versions into a single optimized S model that provides the best balance of speed and quality for Whisper Large v3. |
|
|
|
|
|
## Audio Examples |
|
|
|
Below are examples demonstrating the transcription quality of the Elastic Whisper Large v3 S model compared to the original. |
|
|
|
**Example Audio Transcriptions:** |
|
|
|
| Audio Sample | Original Whisper Large v3 | Elastic S Model | |
|
|---|---|---| |
|
| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/io62uN1l-tpqigMlzQMlm.mpga"></audio> | joel keaton disapproved of films and buster also had reservations about the medium | joel keaton disapproved of films and buster also had reservations about the medium | |
|
| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/CVabXfIP_Q5qxIjzoy5N6.mpga"></audio> | she ll be alright | she ll be alright | |
|
| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/-fidVnQcCa32c7-2rNz-w.mpga"></audio> | all is well that ends well | all is well that ends well | |
|
## Inference |
|
|
|
To infer our Whisper models, you primarily use the `elastic_models.transformers.WhisperForConditionalGeneration` class. |
|
|
|
**Example using `elastic_models` with the optimized model:** |
|
|
|
```python |
|
import torch |
|
import librosa # check that you have this package installed |
|
from transformers import AutoProcessor |
|
from transformers.pipelines import pipeline |
|
from elastic_models.transformers import WhisperForConditionalGeneration |
|
|
|
model_name = "openai/whisper-large-v3" |
|
mode = "S" |
|
|
|
audio_path = "path_to_your_audio.wav" |
|
hf_token = "YOUR_TOKEN" |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
# Load processor and model |
|
processor = AutoProcessor.from_pretrained(model_name, token=hf_token) |
|
|
|
model = WhisperForConditionalGeneration.from_pretrained( |
|
model_name, |
|
token=hf_token, |
|
torch_dtype=torch.float16, |
|
mode=mode, |
|
device_map=device, |
|
) |
|
model.eval() |
|
|
|
# Create pipeline |
|
generator = pipeline( |
|
task="automatic-speech-recognition", |
|
model=model, |
|
tokenizer=processor.tokenizer, |
|
feature_extractor=processor.feature_extractor, |
|
device=device, |
|
) |
|
|
|
# Load audio |
|
audio, sr = librosa.load(audio_path, sr=16000) |
|
|
|
print(f"Transcribing audio from: {audio_path}") |
|
|
|
# Generate transcription using pipeline |
|
generate_kwargs = { |
|
"max_new_tokens": 100, |
|
"num_beams": 1, |
|
} |
|
|
|
result = generator( |
|
audio, |
|
generate_kwargs=generate_kwargs, |
|
) |
|
|
|
transcription = result["text"] |
|
|
|
print(f"Transcription: {transcription}") |
|
``` |
|
|
|
__System requirements:__ |
|
* GPUs: NVIDIA GeForce 4090, NVIDIA GeForce 5090, H100, L40S |
|
* CPU: AMD, Intel |
|
* Python: 3.8-3.12 (check dependencies for specific versions) |
|
|
|
To work with our elastic models and compilation tools, you'll need to install `elastic_models` and `qlip` libraries from TheStage: |
|
|
|
```shell |
|
pip install thestage |
|
pip install 'thestage-elastic-models[nvidia]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple |
|
pip install flash-attn==2.7.3 --no-build-isolation |
|
pip install tensorrt==10.11.0.33 # for 4090 |
|
pip uninstall apex |
|
|
|
# or for blackwell support |
|
pip install 'thestage-elastic-models[blackwell]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple |
|
pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 |
|
# please download the appropriate version of Wheels for your system from https://github.com/Zarrac/flashattention-blackwell-wheels-whl-ONLY-5090-5080-5070-5060-flash-attention-/releases/tag/FlashAttention |
|
mv flash_attn-2.7.4.post1-rtx5090-torch2.7.0cu128cxx11abiTRUE-cp311-linux_x86_64.whl flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl |
|
pip install flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl |
|
pip install tensorrt==10.11.0.33 |
|
pip uninstall apex |
|
``` |
|
|
|
Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows: |
|
|
|
```shell |
|
thestage config set --api-token <YOUR_API_TOKEN> |
|
``` |
|
|
|
Congrats, now you can use accelerated models and tools! |
|
|
|
---- |
|
|
|
## Benchmarks |
|
|
|
Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for Whisper models using our algorithms. |
|
|
|
### Quality benchmarks |
|
|
|
Performance evaluation on standard speech recognition benchmarks: |
|
|
|
| Metric/Model | S | Original | |
|
|--------------|---|----------| |
|
| WER (Common Voice) | 0.18 | 0.22 | |
|
|
|
* **WER (Word Error Rate)**: The primary metric for evaluating speech recognition accuracy. Lower is better. |
|
* **Common Voice**: Multilingual speech recognition benchmark covering diverse languages and accents. |
|
|
|
### Latency benchmarks (tps) |
|
|
|
Performance for transcribing audio (tps): |
|
|
|
**Batch Size 1:** |
|
|
|
| GPU Type | S | Original | |
|
|----------|---|----------| |
|
| H100 | 223.47 | 82.84 | |
|
| L40S | 210.67 | 72.36 | |
|
| GeForce RTX 4090 | 240 | 86.63 | |
|
| GeForce RTX 5090 | 265.93 | 195.76 | |
|
|
|
## Links |
|
|
|
* __Platform__: [app.thestage.ai](https://app.thestage.ai) |
|
* __Subscribe for updates__: [TheStageAI X (Twitter)](https://x.com/TheStageAI) |
|
* __Contact email__: [email protected] |