AudioX: Multilingual Speech-to-Text Model
AudioX is a state-of-the-art Indic multilingual automatic speech recognition (ASR) model family developed by Jivi AI. It comprises two specialized variants—AudioX-North and AudioX-South—each optimized for a distinct set of Indian languages to ensure better accuracy. AudioX-North supports Hindi, Gujarati, and Marathi, while AudioX-South covers Tamil, Telugu, Kannada, and Malayalam. Trained on a combination of open-source ASR datasets and proprietary audio, the AudioX models offer robust transcription capabilities across accents and acoustic conditions, delivering industry-leading performance across supported languages.
Purpose-Built for Indian Languages:
AudioX is designed to handle diverse Indian language inputs, supporting real-world applications such as voice assistants, transcription tools, customer service automation, and multilingual content creation. It provides high accuracy across regional accents and varying audio qualities.
Training Process:
AudioX is fine-tuned using supervised learning on top of an open-source speech recognition backbone. The training pipeline incorporates domain adaptation, language balancing, and noise augmentation for robustness across real-world scenarios.
Data Preparation:
The model is trained on:
- Open-source multilingual ASR corpora
- Proprietary Indian language medical datasets,
This hybrid approach boosts the model’s generalization across dialects and acoustic conditions.
Benchmarks:
AudioX achieves top performance across multiple Indian languages, outperforming both open and commercial ASR models. We evaluated AudioX on the Vistaar Benchmark using the official evaluation script provided by AI4Bharat’s Vistaar suite, ensuring rigorous, standardized comparison across diverse language scenarios.
Provider | Model | Hindi | Gujarati | Marathi | Tamil | Telugu | Kannada | Malayalam | Avg WER |
---|---|---|---|---|---|---|---|---|---|
Jivi AI | AudioX | 12.14 | 18.66 | 18.68 | 21.79 | 24.63 | 17.61 | 26.92 | 20.1 |
ElevenLabs | Scribe-v1 | 13.64 | 17.96 | 16.51 | 24.84 | 24.89 | 17.65 | 28.88 | 20.6 |
Sarvam | saarika:v2 | 14.28 | 19.47 | 18.34 | 25.73 | 26.80 | 18.95 | 32.64 | 22.3 |
AI4Bharat | IndicWhisper | 13.59 | 22.84 | 18.25 | 25.27 | 28.82 | 18.33 | 32.34 | 22.8 |
Microsoft | Azure STT | 20.03 | 31.62 | 27.36 | 31.53 | 31.38 | 26.45 | 41.84 | 30.0 |
OpenAI | gpt-4o-transcribe | 18.65 | 31.32 | 25.21 | 39.10 | 33.94 | 32.88 | 46.11 | 32.5 |
Google STT | 23.89 | 36.48 | 26.48 | 33.62 | 42.42 | 31.48 | 47.90 | 34.6 | |
OpenAI | Whisper Large v3 | 32.00 | 53.75 | 78.28 | 52.44 | 179.58 | 67.02 | 142.98 | 86.6 |
🔧 Try This Model
You can easily run inference using the 🤗 transformers
and librosa
libraries. Here's a minimal example to get started:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
# Load model and processor
device = "cuda"
processor = WhisperProcessor.from_pretrained("jiviai/audioX-south-v1")
model = WhisperForConditionalGeneration.from_pretrained("jiviai/audioX-south-v1").to(device)
model.config.forced_decoder_ids = None
# Load and preprocess audio
audio_path = "sample.wav"
audio_np, sr = librosa.load(audio_path, sr=None)
if sr != 16000:
audio_np = librosa.resample(audio_np, orig_sr=sr, target_sr=16000)
input_features = processor(audio_np, sampling_rate=16000, return_tensors="pt").to(device).input_features
# Generate predictions
# Use ISO 639-1 language codes: "hi", "mr", "gu" for North; "ta", "te", "kn", "ml" for South
# Or omit the language argument for automatic language detection
predicted_ids = model.generate(input_features, task="transcribe", language="ta")
# Decode predictions
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
- Downloads last month
- 172
Evaluation results
- WER on Vistaar Benchmark Tamiltest set self-reported21.790
- WER on Vistaar Benchmark Telugutest set self-reported24.630
- WER on Vistaar Benchmark Kannadatest set self-reported17.610
- WER on Vistaar Benchmark Malayalamtest set self-reported26.920