Add model card for MT5-SMALL-indicvoice-ic-63500-mt5-small

6d66448 verified 10 days ago

3.24 kB

	---
	language:
	- hi
	- bn
	- te
	- ta
	- gu
	- kn
	- ml
	- or
	- pa
	- as
	library_name: transformers
	base_model: google/mt5-small
	tags:
	- text-generation
	- asr-error-correction
	- indic-languages
	- multilingual
	- mt5
	datasets:
	- indicvoice
	pipeline_tag: text-generation
	license: apache-2.0
	model-index:
	- name: MT5-SMALL-indicvoice-ic-63500-mt5-small
	results: []
	---

	# MT5-SMALL-indicvoice-ic-63500-mt5-small

	This is a fine-tuned mT5 Small model trained on Indic ASR data using Hugging Face Transformers.
	The goal of this model is to correct post-ASR transcription errors in Indic languages.

	Fine-tuning large multilingual models like mT5 on domain-specific datasets like ASR outputs allows improved performance, especially in noisy or low-resource environments. These models are powerful at generalizing across varied languages and dialects when backed by high-quality finetuning datasets.

	### 📚 Datasets Used
	- Indicvoice

	### 🎧 Transcription Models
	- Indic Conformer (IC)

	---

	### ⚙️ Training Info

	- Trained over a period of 6 months
	- Used A100 GPUs
	- Developed as part of a research collaboration with IIT Bombay
	- Focused on improving transcription accuracy of ASR systems in Indic languages

	### 🚀 Usage

	```python
	import pandas as pd
	from datasets import Dataset
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
	import torch

	def load_model_and_tokenizer(model_path, tokenizer_path):
	print("Loading model and tokenizer...")
	model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device)
	tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
	return model, tokenizer

	def run_inference(input_csv_path, output_csv_path, model, tokenizer):
	print(f"Loading data from {input_csv_path}...")
	data_df = pd.read_csv(input_csv_path, header=None)
	data_df.columns = ['Hypothesis', 'Corrected Hypothesis']

	dataset = Dataset.from_pandas(data_df.rename(columns={'Hypothesis': 'input', 'Corrected Hypothesis': 'target'}))

	predictions = []

	print("Running inference...")
	for item in dataset:
	input_text = item['input']

	input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)
	outputs = model.generate(input_ids, max_length=512)
	decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

	predictions.append(decoded_output)

	data_df['Predictions'] = predictions
	data_df.to_csv(output_csv_path, index=False)
	print(f"Predictions saved to {output_csv_path}")

	# Usage
	model_path = "cazzz307/MT5-SMALL-indicvoice-ic-63500-mt5-small" # Hugging Face model path
	tokenizer_path = "cazzz307/MT5-SMALL-indicvoice-ic-63500-mt5-small" # Same as model path (adjust if tokenizer is separate)

	device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
	model, tokenizer = load_model_and_tokenizer(model_path, tokenizer_path)

	input_csv = "your_input.csv" # Replace with your input file path
	output_csv = "predictions.csv" # Replace with your desired output file path
	run_inference(input_csv, output_csv, model, tokenizer)
	```

	Note: Adjust the `tokenizer_path` if your tokenizer files are in a separate location or subdirectory.