metadata

license: apache-2.0
tags:
  - automatic-speech-recognition
  - audio
  - speech
  - whisper
  - multilingual
model-index:
  - name: Jivi-AudioX-North
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Vistaar Benchmark Hindi
          type: vistaar
          config: hindi
          split: test
        metrics:
          - name: WER
            type: wer
            value: 12.14
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Vistaar Benchmark Gujarati
          type: vistaar
          config: gujarati
          split: test
        metrics:
          - name: WER
            type: wer
            value: 18.66
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Vistaar Benchmark Marathi
          type: vistaar
          config: marathi
          split: test
        metrics:
          - name: WER
            type: wer
            value: 18.68
language:
  - hi
  - gu
  - mr
pipeline_tag: automatic-speech-recognition

AudioX: Multilingual Speech-to-Text Model

AudioX is a state-of-the-art Indic multilingual automatic speech recognition (ASR) model family developed by Jivi AI. It comprises two specialized variants—AudioX-North and AudioX-South—each optimized for a distinct set of Indian languages to ensure better accuracy. AudioX-North supports Hindi, Gujarati, and Marathi, while AudioX-South covers Tamil, Telugu, Kannada, and Malayalam. Trained on a combination of open-source ASR datasets and proprietary audio, the AudioX models offer robust transcription capabilities across accents and acoustic conditions, delivering industry-leading performance across supported languages. AudioX

Purpose-Built for Indian Languages:

AudioX is designed to handle diverse Indian language inputs, supporting real-world applications such as voice assistants, transcription tools, customer service automation, and multilingual content creation. It provides high accuracy across regional accents and varying audio qualities.

Training Process:

AudioX is fine-tuned using supervised learning on top of an open-source speech recognition backbone. The training pipeline incorporates domain adaptation, language balancing, and noise augmentation for robustness across real-world scenarios.

Data Preparation:

The model is trained on:

Open-source multilingual ASR corpora
Proprietary Indian language medical datasets,

This hybrid approach boosts the model’s generalization across dialects and acoustic conditions.

Benchmarks:

AudioX achieves top performance across multiple Indian languages, outperforming both open and commercial ASR models. We evaluated AudioX on the Vistaar Benchmark using the official evaluation script provided by AI4Bharat’s Vistaar suite, ensuring rigorous, standardized comparison across diverse language scenarios.

Provider	Model	Hindi	Gujarati	Marathi	Tamil	Telugu	Kannada	Malayalam	Avg WER
Jivi AI	AudioX	12.14	18.66	18.68	21.79	24.63	17.61	26.92	20.1
ElevenLabs	Scribe-v1	13.64	17.96	16.51	24.84	24.89	17.65	28.88	20.6
Sarvam	saarika:v2	14.28	19.47	18.34	25.73	26.80	18.95	32.64	22.3
AI4Bharat	IndicWhisper	13.59	22.84	18.25	25.27	28.82	18.33	32.34	22.8
Microsoft	Azure STT	20.03	31.62	27.36	31.53	31.38	26.45	41.84	30.0
OpenAI	gpt-4o-transcribe	18.65	31.32	25.21	39.10	33.94	32.88	46.11	32.5
Google	Google STT	23.89	36.48	26.48	33.62	42.42	31.48	47.90	34.6
OpenAI	Whisper Large v3	32.00	53.75	78.28	52.44	179.58	67.02	142.98	86.6

🔧 Try This Model

You can easily run inference using the 🤗 transformers and librosa libraries. Here's a minimal example to get started:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

# Load model and processor
device = "cuda"
processor = WhisperProcessor.from_pretrained("jiviai/audioX-north-v1")
model = WhisperForConditionalGeneration.from_pretrained("jiviai/audioX-north-v1").to(device)
model.config.forced_decoder_ids = None

# Load and preprocess audio
audio_path = "sample.wav"
audio_np, sr = librosa.load(audio_path, sr=None)
if sr != 16000:
    audio_np = librosa.resample(audio_np, orig_sr=sr, target_sr=16000)

input_features = processor(audio_np, sampling_rate=16000, return_tensors="pt").to(device).input_features

# Generate predictions
# Use ISO 639-1 language codes: "hi", "mr", "gu" for North; "ta", "te", "kn", "ml" for South
# Or omit the language argument for automatic language detection
predicted_ids = model.generate(input_features, task="transcribe", language="hi")

# Decode predictions
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)