license: apache-2.0
tags:
- automatic-speech-recognition
- audio
- speech
- whisper
- multilingual
model-index:
- name: Jivi-AudioX-South
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Vistaar Benchmark Tamil
type: vistaar
config: tamil
split: test
metrics:
- name: WER
type: wer
value: 21.79
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Vistaar Benchmark Telugu
type: vistaar
config: telugu
split: test
metrics:
- name: WER
type: wer
value: 24.63
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Vistaar Benchmark Kannada
type: vistaar
config: kannada
split: test
metrics:
- name: WER
type: wer
value: 17.61
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Vistaar Benchmark Malayalam
type: vistaar
config: malayalam
split: test
metrics:
- name: WER
type: wer
value: 26.92
pipeline_tag: automatic-speech-recognition
language:
- ta
- te
- kn
- ml
AudioX: Multilingual Speech-to-Text Model
AudioX is a state-of-the-art Indic multilingual automatic speech recognition (ASR) model family developed by Jivi AI. It comprises two specialized variants—AudioX-North and AudioX-South—each optimized for a distinct set of Indian languages to ensure better accuracy. AudioX-North supports Hindi, Gujarati, and Marathi, while AudioX-South covers Tamil, Telugu, Kannada, and Malayalam. Trained on a combination of open-source ASR datasets and proprietary audio, the AudioX models offer robust transcription capabilities across accents and acoustic conditions, delivering industry-leading performance across supported languages.
Purpose-Built for Indian Languages:
AudioX is designed to handle diverse Indian language inputs, supporting real-world applications such as voice assistants, transcription tools, customer service automation, and multilingual content creation. It provides high accuracy across regional accents and varying audio qualities.
Training Process:
AudioX is fine-tuned using supervised learning on top of an open-source speech recognition backbone. The training pipeline incorporates domain adaptation, language balancing, and noise augmentation for robustness across real-world scenarios.
Data Preparation:
The model is trained on:
- Open-source multilingual ASR corpora
- Proprietary Indian language medical datasets,
This hybrid approach boosts the model’s generalization across dialects and acoustic conditions.
Benchmarks:
AudioX achieves top performance across multiple Indian languages, outperforming both open and commercial ASR models. We evaluated AudioX on the Vistaar Benchmark using the official evaluation script provided by AI4Bharat’s Vistaar suite, ensuring rigorous, standardized comparison across diverse language scenarios.
Provider | Model | Hindi | Gujarati | Marathi | Tamil | Telugu | Kannada | Malayalam | Avg WER |
---|---|---|---|---|---|---|---|---|---|
Jivi AI | AudioX | 12.14 | 18.66 | 18.68 | 21.79 | 24.63 | 17.61 | 26.92 | 20.1 |
ElevenLabs | Scribe-v1 | 13.64 | 17.96 | 16.51 | 24.84 | 24.89 | 17.65 | 28.88 | 20.6 |
Sarvam | saarika:v2 | 14.28 | 19.47 | 18.34 | 25.73 | 26.80 | 18.95 | 32.64 | 22.3 |
AI4Bharat | IndicWhisper | 13.59 | 22.84 | 18.25 | 25.27 | 28.82 | 18.33 | 32.34 | 22.8 |
Microsoft | Azure STT | 20.03 | 31.62 | 27.36 | 31.53 | 31.38 | 26.45 | 41.84 | 30.0 |
OpenAI | gpt-4o-transcribe | 18.65 | 31.32 | 25.21 | 39.10 | 33.94 | 32.88 | 46.11 | 32.5 |
Google STT | 23.89 | 36.48 | 26.48 | 33.62 | 42.42 | 31.48 | 47.90 | 34.6 | |
OpenAI | Whisper Large v3 | 32.00 | 53.75 | 78.28 | 52.44 | 179.58 | 67.02 | 142.98 | 86.6 |
🔧 Try This Model
You can easily run inference using the 🤗 transformers
and librosa
libraries. Here's a minimal example to get started:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
# Load model and processor
device = "cuda"
processor = WhisperProcessor.from_pretrained("jiviai/audioX-south-v1")
model = WhisperForConditionalGeneration.from_pretrained("jiviai/audioX-south-v1").to(device)
model.config.forced_decoder_ids = None
# Load and preprocess audio
audio_path = "sample.wav"
audio_np, sr = librosa.load(audio_path, sr=None)
if sr != 16000:
audio_np = librosa.resample(audio_np, orig_sr=sr, target_sr=16000)
input_features = processor(audio_np, sampling_rate=16000, return_tensors="pt").to(device).input_features
# Generate predictions
# Use ISO 639-1 language codes: "hi", "mr", "gu" for North; "ta", "te", "kn", "ml" for South
# Or omit the language argument for automatic language detection
predicted_ids = model.generate(input_features, task="transcribe", language="ta")
# Decode predictions
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)