metadata

language:
  - hi
  - bn
  - te
  - ta
  - gu
  - kn
  - ml
  - or
  - pa
  - as
library_name: transformers
base_model: google/mt5-small
tags:
  - text-generation
  - asr-error-correction
  - indic-languages
  - multilingual
  - mt5
datasets:
  - indicvoice
pipeline_tag: text-generation
license: apache-2.0
model-index:
  - name: MT5-SMALL-indicvoice-ic-63500-mt5-small
    results: []

MT5-SMALL-indicvoice-ic-63500-mt5-small

This is a fine-tuned mT5 Small model trained on Indic ASR data using Hugging Face Transformers. The goal of this model is to correct post-ASR transcription errors in Indic languages.

Fine-tuning large multilingual models like mT5 on domain-specific datasets like ASR outputs allows improved performance, especially in noisy or low-resource environments. These models are powerful at generalizing across varied languages and dialects when backed by high-quality finetuning datasets.

📚 Datasets Used

Indicvoice

🎧 Transcription Models

Indic Conformer (IC)

⚙️ Training Info

Trained over a period of 6 months
Used A100 GPUs
Developed as part of a research collaboration with IIT Bombay
Focused on improving transcription accuracy of ASR systems in Indic languages

🚀 Usage

import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

def load_model_and_tokenizer(model_path, tokenizer_path):
    print("Loading model and tokenizer...")
    model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device)
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
    return model, tokenizer

def run_inference(input_csv_path, output_csv_path, model, tokenizer):
    print(f"Loading data from {input_csv_path}...")
    data_df = pd.read_csv(input_csv_path, header=None)
    data_df.columns = ['Hypothesis', 'Corrected Hypothesis']

    dataset = Dataset.from_pandas(data_df.rename(columns={'Hypothesis': 'input', 'Corrected Hypothesis': 'target'}))

    predictions = []

    print("Running inference...")
    for item in dataset:
        input_text = item['input']

        input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)
        outputs = model.generate(input_ids, max_length=512)
        decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

        predictions.append(decoded_output)

    data_df['Predictions'] = predictions
    data_df.to_csv(output_csv_path, index=False)
    print(f"Predictions saved to {output_csv_path}")

# Usage
model_path = "cazzz307/MT5-SMALL-indicvoice-ic-63500-mt5-small"  # Hugging Face model path
tokenizer_path = "cazzz307/MT5-SMALL-indicvoice-ic-63500-mt5-small"  # Same as model path (adjust if tokenizer is separate)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model, tokenizer = load_model_and_tokenizer(model_path, tokenizer_path)

input_csv = "your_input.csv"  # Replace with your input file path
output_csv = "predictions.csv"  # Replace with your desired output file path
run_inference(input_csv, output_csv, model, tokenizer)

Note: Adjust the tokenizer_path if your tokenizer files are in a separate location or subdirectory.