Elastic model: Whisper Large v3. Fastest and most flexible models for self-serving.

Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:

XL: Mathematically equivalent neural network, optimized with our DNN compiler.
L: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
M: Faster model, with accuracy degradation less than 1.5%.
S: The fastest model, with accuracy degradation less than 2%.

Goals of elastic models:

Provide flexibility in cost vs quality selection for inference
Provide clear quality and latency benchmarks for speech recognition
Provide interface of HF libraries: transformers and elastic_models with a single line of code change for using optimized versions
Provide models supported on a wide range of hardware (NVIDIA GPUs), which are pre-compiled and require no JIT
Provide the best models and service for self-hosting

It's important to note that we have consolidated all elastic model versions into a single optimized S model that provides the best balance of speed and quality for Whisper Large v3.

Audio Examples

Below are examples demonstrating the transcription quality of the Elastic Whisper Large v3 S model compared to the original.

Example Audio Transcriptions:

Audio Sample	Original Whisper Large v3	Elastic S Model
	joel keaton disapproved of films and buster also had reservations about the medium	joel keaton disapproved of films and buster also had reservations about the medium
	she ll be alright	she ll be alright
	all is well that ends well	all is well that ends well

Inference

To infer our Whisper models, you primarily use the elastic_models.transformers.WhisperForConditionalGeneration class.

Example using elastic_models with the optimized model:

import torch
import librosa  # check that you have this package installed
from transformers import AutoProcessor
from transformers.pipelines import pipeline
from elastic_models.transformers import WhisperForConditionalGeneration

model_name = "openai/whisper-large-v3"
mode = "S"

audio_path = "path_to_your_audio.wav"
hf_token = "YOUR_TOKEN"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load processor and model
processor = AutoProcessor.from_pretrained(model_name, token=hf_token)

model = WhisperForConditionalGeneration.from_pretrained(
    model_name,
    token=hf_token,
    torch_dtype=torch.float16,
    mode=mode,
    device_map=device,
)
model.eval()

# Create pipeline
generator = pipeline(
    task="automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device=device,
)

# Load audio
audio, sr = librosa.load(audio_path, sr=16000)

print(f"Transcribing audio from: {audio_path}")

# Generate transcription using pipeline
generate_kwargs = {
    "max_new_tokens": 100,
    "num_beams": 1,
}

result = generator(
    audio,
    generate_kwargs=generate_kwargs,
)

transcription = result["text"]

print(f"Transcription: {transcription}")

System requirements:

GPUs: NVIDIA GeForce 4090, NVIDIA GeForce 5090, H100, L40S
CPU: AMD, Intel
Python: 3.8-3.12 (check dependencies for specific versions)

To work with our elastic models and compilation tools, you'll need to install elastic_models and qlip libraries from TheStage:

pip install thestage
pip install 'thestage-elastic-models[nvidia]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install flash-attn==2.7.3 --no-build-isolation
pip install tensorrt==10.11.0.33 # for 4090
pip uninstall apex

# or for blackwell support
pip install 'thestage-elastic-models[blackwell]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# please download the appropriate version of Wheels for your system from https://github.com/Zarrac/flashattention-blackwell-wheels-whl-ONLY-5090-5080-5070-5060-flash-attention-/releases/tag/FlashAttention
mv flash_attn-2.7.4.post1-rtx5090-torch2.7.0cu128cxx11abiTRUE-cp311-linux_x86_64.whl flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip install flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip install tensorrt==10.11.0.33
pip uninstall apex

Then go to app.thestage.ai, login and generate API token from your profile page. Set up API token as follows:

thestage config set --api-token <YOUR_API_TOKEN>

Congrats, now you can use accelerated models and tools!

Benchmarks

Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for Whisper models using our algorithms.

Quality benchmarks

Performance evaluation on standard speech recognition benchmarks:

Metric/Model	S	Original
WER (Common Voice)	0.18	0.22

WER (Word Error Rate): The primary metric for evaluating speech recognition accuracy. Lower is better.
Common Voice: Multilingual speech recognition benchmark covering diverse languages and accents.

Latency benchmarks (tps)

Performance for transcribing audio (tps):

Batch Size 1:

GPU Type	S	Original
H100	223.47	82.84
L40S	210.67	72.36
GeForce RTX 4090	240	86.63
GeForce RTX 5090	265.93	195.76

Model tree for TheStageAI/Elastic-whisper-large-v3

Base model

openai/whisper-large-v3

Quantized

(16)

this model

Collection including TheStageAI/Elastic-whisper-large-v3

Elastic Transformers

Collection

Hugging Face Transformers models accelerated by TheStage AI ANNA: Automated NNs Accelerator. • 16 items • Updated 25 days ago • 1

TheStageAI
/

Elastic-whisper-large-v3