psynote123's picture
Update README.md
18cdbdf verified
---
license: apache-2.0
base_model:
- openai/whisper-large-v3
base_model_relation: quantized
pipeline_tag: automatic-speech-recognition
language:
- en
- zh
- de
- es
- ru
- ko
- fr
- ja
- pt
- tr
- pl
- ca
- nl
- ar
- sv
- it
- id
- hi
- fi
- vi
- he
- uk
- el
- ms
- cs
- ro
- da
- hu
- ta
- no
- th
- ur
- hr
- bg
- lt
- la
- mi
- ml
- cy
- sk
- te
- fa
- lv
- bn
- sr
- az
- sl
- kn
- et
- mk
- br
- eu
- is
- hy
- ne
- mn
- bs
- kk
- sq
- sw
- gl
- mr
- pa
- si
- km
- sn
- yo
- so
- af
- oc
- ka
- be
- tg
- sd
- gu
- am
- yi
- lo
- uz
- fo
- ht
- ps
- tk
- nn
- mt
- sa
- lb
- my
- bo
- tl
- mg
- as
- tt
- haw
- ln
- ha
- ba
- jw
- su
- yue
tags:
- audio
- automatic-speech-recognition
- speech-recognition
- whisper
- annthem
- qlip
- thestage
---
# Elastic model: Whisper Large v3. Fastest and most flexible models for self-serving.
Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
* __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
* __M__: Faster model, with accuracy degradation less than 1.5%.
* __S__: The fastest model, with accuracy degradation less than 2%.
__Goals of elastic models:__
* Provide flexibility in cost vs quality selection for inference
* Provide clear quality and latency benchmarks for speech recognition
* Provide interface of HF libraries: `transformers` and `elastic_models` with a single line of code change for using optimized versions
* Provide models supported on a wide range of hardware (NVIDIA GPUs), which are pre-compiled and require no JIT
* Provide the best models and service for self-hosting
> It's important to note that we have consolidated all elastic model versions into a single optimized S model that provides the best balance of speed and quality for Whisper Large v3.
## Audio Examples
Below are examples demonstrating the transcription quality of the Elastic Whisper Large v3 S model compared to the original.
**Example Audio Transcriptions:**
| Audio Sample | Original Whisper Large v3 | Elastic S Model |
|---|---|---|
| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/io62uN1l-tpqigMlzQMlm.mpga"></audio> | joel keaton disapproved of films and buster also had reservations about the medium | joel keaton disapproved of films and buster also had reservations about the medium |
| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/CVabXfIP_Q5qxIjzoy5N6.mpga"></audio> | she ll be alright | she ll be alright |
| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/-fidVnQcCa32c7-2rNz-w.mpga"></audio> | all is well that ends well | all is well that ends well |
## Inference
To infer our Whisper models, you primarily use the `elastic_models.transformers.WhisperForConditionalGeneration` class.
**Example using `elastic_models` with the optimized model:**
```python
import torch
import librosa # check that you have this package installed
from transformers import AutoProcessor
from transformers.pipelines import pipeline
from elastic_models.transformers import WhisperForConditionalGeneration
model_name = "openai/whisper-large-v3"
mode = "S"
audio_path = "path_to_your_audio.wav"
hf_token = "YOUR_TOKEN"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load processor and model
processor = AutoProcessor.from_pretrained(model_name, token=hf_token)
model = WhisperForConditionalGeneration.from_pretrained(
model_name,
token=hf_token,
torch_dtype=torch.float16,
mode=mode,
device_map=device,
)
model.eval()
# Create pipeline
generator = pipeline(
task="automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=device,
)
# Load audio
audio, sr = librosa.load(audio_path, sr=16000)
print(f"Transcribing audio from: {audio_path}")
# Generate transcription using pipeline
generate_kwargs = {
"max_new_tokens": 100,
"num_beams": 1,
}
result = generator(
audio,
generate_kwargs=generate_kwargs,
)
transcription = result["text"]
print(f"Transcription: {transcription}")
```
__System requirements:__
* GPUs: NVIDIA GeForce 4090, NVIDIA GeForce 5090, H100, L40S
* CPU: AMD, Intel
* Python: 3.8-3.12 (check dependencies for specific versions)
To work with our elastic models and compilation tools, you'll need to install `elastic_models` and `qlip` libraries from TheStage:
```shell
pip install thestage
pip install 'thestage-elastic-models[nvidia]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install flash-attn==2.7.3 --no-build-isolation
pip install tensorrt==10.11.0.33 # for 4090
pip uninstall apex
# or for blackwell support
pip install 'thestage-elastic-models[blackwell]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# please download the appropriate version of Wheels for your system from https://github.com/Zarrac/flashattention-blackwell-wheels-whl-ONLY-5090-5080-5070-5060-flash-attention-/releases/tag/FlashAttention
mv flash_attn-2.7.4.post1-rtx5090-torch2.7.0cu128cxx11abiTRUE-cp311-linux_x86_64.whl flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip install flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip install tensorrt==10.11.0.33
pip uninstall apex
```
Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
```shell
thestage config set --api-token <YOUR_API_TOKEN>
```
Congrats, now you can use accelerated models and tools!
----
## Benchmarks
Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for Whisper models using our algorithms.
### Quality benchmarks
Performance evaluation on standard speech recognition benchmarks:
| Metric/Model | S | Original |
|--------------|---|----------|
| WER (Common Voice) | 0.18 | 0.22 |
* **WER (Word Error Rate)**: The primary metric for evaluating speech recognition accuracy. Lower is better.
* **Common Voice**: Multilingual speech recognition benchmark covering diverse languages and accents.
### Latency benchmarks (tps)
Performance for transcribing audio (tps):
**Batch Size 1:**
| GPU Type | S | Original |
|----------|---|----------|
| H100 | 223.47 | 82.84 |
| L40S | 210.67 | 72.36 |
| GeForce RTX 4090 | 240 | 86.63 |
| GeForce RTX 5090 | 265.93 | 195.76 |
## Links
* __Platform__: [app.thestage.ai](https://app.thestage.ai)
* __Subscribe for updates__: [TheStageAI X (Twitter)](https://x.com/TheStageAI)
* __Contact email__: [email protected]