metadata

license: apache-2.0
pipeline_tag: automatic-speech-recognition
base_model:
  - openai/whisper-large-v3
tags:
  - inference_endpoints
  - audio
  - transcription

Inference Endpoint - Multilingual Audio Transcription with Whisper models

Deploy OpenAI's Whisper Inference Endpoint to transcribe audio files to text in many languages

Resulting deployment exposes an OpenAI Platform Transcription compatible HTTP endpoint which you can query using the OpenAi Libraries or directly through cURL for instance.

Available Routes

path	description
/api/v1/audio/transcriptions	Transcription endpoint to interact with the model
/docs	Visual documentation

Getting started

Getting text output from audio file

curl http://localhost:8000/api/v1/audio/transcriptions \
  --request POST \
  --header 'Content-Type: multipart/form-data' \
  -F file=@</path/to/audio/file> \
  -F "response_format": "text"

Getting JSON output from audio file

curl http://localhost:8000/api/v1/audio/transcriptions \
  --request POST \
  --header 'Content-Type: multipart/form-data' \
  -F file=@</path/to/audio/file> \
  -F "response_format": "json"

Getting segmented JSON output from audio file

curl http://localhost:8000/api/v1/audio/transcriptions \
  --request POST \
  --header 'Content-Type: multipart/form-data' \
  -F file=@</path/to/audio/file> \
  -F "response_format": "verbose_json"

Specifications

spec	value	description
Engine	vLLM (v0.8.3)	Underlying inference engine leverages vLLM
Hardware	GPU (Ada Lovelace)	Requires the target endpoint to run over NVIDIA GPUs with at least compute capabilities 8.9 (Ada Lovelace)
Compute data type	`bfloat16`	Computations (matmuls, norms, etc.) are done using `bfloat16` precision
KV cache data type	`float8` (e4m3)	Key-Value cache is stored on the GPU using `float8` (`float8_e4m3`) precision to save space
PyTorch Compile	✅	Enable the use of `torch.compile` to further optimize model's execution with more optimizations
CUDA Graphs	✅	Enable the use of so called "CUDA Graphs" to reduce overhead executing GPU computations