Update README.md

18cdbdf verified 8 days ago

7.21 kB

	---
	license: apache-2.0
	base_model:
	- openai/whisper-large-v3
	base_model_relation: quantized
	pipeline_tag: automatic-speech-recognition
	language:
	- en
	- zh
	- de
	- es
	- ru
	- ko
	- fr
	- ja
	- pt
	- tr
	- pl
	- ca
	- nl
	- ar
	- sv
	- it
	- id
	- hi
	- fi
	- vi
	- he
	- uk
	- el
	- ms
	- cs
	- ro
	- da
	- hu
	- ta
	- no
	- th
	- ur
	- hr
	- bg
	- lt
	- la
	- mi
	- ml
	- cy
	- sk
	- te
	- fa
	- lv
	- bn
	- sr
	- az
	- sl
	- kn
	- et
	- mk
	- br
	- eu
	- is
	- hy
	- ne
	- mn
	- bs
	- kk
	- sq
	- sw
	- gl
	- mr
	- pa
	- si
	- km
	- sn
	- yo
	- so
	- af
	- oc
	- ka
	- be
	- tg
	- sd
	- gu
	- am
	- yi
	- lo
	- uz
	- fo
	- ht
	- ps
	- tk
	- nn
	- mt
	- sa
	- lb
	- my
	- bo
	- tl
	- mg
	- as
	- tt
	- haw
	- ln
	- ha
	- ba
	- jw
	- su
	- yue
	tags:
	- audio
	- automatic-speech-recognition
	- speech-recognition
	- whisper
	- annthem
	- qlip
	- thestage
	---

	# Elastic model: Whisper Large v3. Fastest and most flexible models for self-serving.

	Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:

	* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.

	* __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.

	* __M__: Faster model, with accuracy degradation less than 1.5%.

	* __S__: The fastest model, with accuracy degradation less than 2%.

	__Goals of elastic models:__

	* Provide flexibility in cost vs quality selection for inference
	* Provide clear quality and latency benchmarks for speech recognition
	* Provide interface of HF libraries: `transformers` and `elastic_models` with a single line of code change for using optimized versions
	* Provide models supported on a wide range of hardware (NVIDIA GPUs), which are pre-compiled and require no JIT
	* Provide the best models and service for self-hosting

	> It's important to note that we have consolidated all elastic model versions into a single optimized S model that provides the best balance of speed and quality for Whisper Large v3.


	## Audio Examples

	Below are examples demonstrating the transcription quality of the Elastic Whisper Large v3 S model compared to the original.

	Example Audio Transcriptions:

	\| Audio Sample \| Original Whisper Large v3 \| Elastic S Model \|
	\|---\|---\|---\|
	\| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/io62uN1l-tpqigMlzQMlm.mpga"></audio> \| joel keaton disapproved of films and buster also had reservations about the medium \| joel keaton disapproved of films and buster also had reservations about the medium \|
	\| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/CVabXfIP_Q5qxIjzoy5N6.mpga"></audio> \| she ll be alright \| she ll be alright \|
	\| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/-fidVnQcCa32c7-2rNz-w.mpga"></audio> \| all is well that ends well \| all is well that ends well \|
	## Inference

	To infer our Whisper models, you primarily use the `elastic_models.transformers.WhisperForConditionalGeneration` class.

	Example using `elastic_models` with the optimized model:

	```python
	import torch
	import librosa # check that you have this package installed
	from transformers import AutoProcessor
	from transformers.pipelines import pipeline
	from elastic_models.transformers import WhisperForConditionalGeneration

	model_name = "openai/whisper-large-v3"
	mode = "S"

	audio_path = "path_to_your_audio.wav"
	hf_token = "YOUR_TOKEN"
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# Load processor and model
	processor = AutoProcessor.from_pretrained(model_name, token=hf_token)

	model = WhisperForConditionalGeneration.from_pretrained(
	model_name,
	token=hf_token,
	torch_dtype=torch.float16,
	mode=mode,
	device_map=device,
	)
	model.eval()

	# Create pipeline
	generator = pipeline(
	task="automatic-speech-recognition",
	model=model,
	tokenizer=processor.tokenizer,
	feature_extractor=processor.feature_extractor,
	device=device,
	)

	# Load audio
	audio, sr = librosa.load(audio_path, sr=16000)

	print(f"Transcribing audio from: {audio_path}")

	# Generate transcription using pipeline
	generate_kwargs = {
	"max_new_tokens": 100,
	"num_beams": 1,
	}

	result = generator(
	audio,
	generate_kwargs=generate_kwargs,
	)

	transcription = result["text"]

	print(f"Transcription: {transcription}")
	```

	__System requirements:__
	* GPUs: NVIDIA GeForce 4090, NVIDIA GeForce 5090, H100, L40S
	* CPU: AMD, Intel
	* Python: 3.8-3.12 (check dependencies for specific versions)

	To work with our elastic models and compilation tools, you'll need to install `elastic_models` and `qlip` libraries from TheStage:

	```shell
	pip install thestage
	pip install 'thestage-elastic-models[nvidia]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
	pip install flash-attn==2.7.3 --no-build-isolation
	pip install tensorrt==10.11.0.33 # for 4090
	pip uninstall apex

	# or for blackwell support
	pip install 'thestage-elastic-models[blackwell]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
	pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
	# please download the appropriate version of Wheels for your system from https://github.com/Zarrac/flashattention-blackwell-wheels-whl-ONLY-5090-5080-5070-5060-flash-attention-/releases/tag/FlashAttention
	mv flash_attn-2.7.4.post1-rtx5090-torch2.7.0cu128cxx11abiTRUE-cp311-linux_x86_64.whl flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
	pip install flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
	pip install tensorrt==10.11.0.33
	pip uninstall apex
	```

	Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:

	```shell
	thestage config set --api-token <YOUR_API_TOKEN>
	```

	Congrats, now you can use accelerated models and tools!

	----

	## Benchmarks

	Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for Whisper models using our algorithms.

	### Quality benchmarks

	Performance evaluation on standard speech recognition benchmarks:

	\| Metric/Model \| S \| Original \|
	\|--------------\|---\|----------\|
	\| WER (Common Voice) \| 0.18 \| 0.22 \|

	* WER (Word Error Rate): The primary metric for evaluating speech recognition accuracy. Lower is better.
	* Common Voice: Multilingual speech recognition benchmark covering diverse languages and accents.

	### Latency benchmarks (tps)

	Performance for transcribing audio (tps):

	Batch Size 1:

	\| GPU Type \| S \| Original \|
	\|----------\|---\|----------\|
	\| H100 \| 223.47 \| 82.84 \|
	\| L40S \| 210.67 \| 72.36 \|
	\| GeForce RTX 4090 \| 240 \| 86.63 \|
	\| GeForce RTX 5090 \| 265.93 \| 195.76 \|

	## Links

	* __Platform__: [app.thestage.ai](https://app.thestage.ai)
	* __Subscribe for updates__: [TheStageAI X (Twitter)](https://x.com/TheStageAI)
	* __Contact email__: [email protected]