|
--- |
|
language: |
|
- hi |
|
- bn |
|
- te |
|
- ta |
|
- gu |
|
- kn |
|
- ml |
|
- or |
|
- pa |
|
- as |
|
library_name: transformers |
|
base_model: google/mt5-small |
|
tags: |
|
- text-generation |
|
- asr-error-correction |
|
- indic-languages |
|
- multilingual |
|
- mt5 |
|
datasets: |
|
- indicvoice |
|
pipeline_tag: text-generation |
|
license: apache-2.0 |
|
model-index: |
|
- name: MT5-SMALL-indicvoice-ic-63500-mt5-small |
|
results: [] |
|
--- |
|
|
|
# MT5-SMALL-indicvoice-ic-63500-mt5-small |
|
|
|
This is a fine-tuned mT5 Small model trained on Indic ASR data using Hugging Face Transformers. |
|
The goal of this model is to correct post-ASR transcription errors in Indic languages. |
|
|
|
Fine-tuning large multilingual models like mT5 on domain-specific datasets like ASR outputs allows improved performance, especially in noisy or low-resource environments. These models are powerful at generalizing across varied languages and dialects when backed by high-quality finetuning datasets. |
|
|
|
### π Datasets Used |
|
- Indicvoice |
|
|
|
### π§ Transcription Models |
|
- Indic Conformer (IC) |
|
|
|
--- |
|
|
|
### βοΈ Training Info |
|
|
|
- Trained over a period of 6 months |
|
- Used A100 GPUs |
|
- Developed as part of a research collaboration with **IIT Bombay** |
|
- Focused on improving transcription accuracy of ASR systems in Indic languages |
|
|
|
### π Usage |
|
|
|
```python |
|
import pandas as pd |
|
from datasets import Dataset |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
import torch |
|
|
|
def load_model_and_tokenizer(model_path, tokenizer_path): |
|
print("Loading model and tokenizer...") |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device) |
|
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path) |
|
return model, tokenizer |
|
|
|
def run_inference(input_csv_path, output_csv_path, model, tokenizer): |
|
print(f"Loading data from {input_csv_path}...") |
|
data_df = pd.read_csv(input_csv_path, header=None) |
|
data_df.columns = ['Hypothesis', 'Corrected Hypothesis'] |
|
|
|
dataset = Dataset.from_pandas(data_df.rename(columns={'Hypothesis': 'input', 'Corrected Hypothesis': 'target'})) |
|
|
|
predictions = [] |
|
|
|
print("Running inference...") |
|
for item in dataset: |
|
input_text = item['input'] |
|
|
|
input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device) |
|
outputs = model.generate(input_ids, max_length=512) |
|
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
predictions.append(decoded_output) |
|
|
|
data_df['Predictions'] = predictions |
|
data_df.to_csv(output_csv_path, index=False) |
|
print(f"Predictions saved to {output_csv_path}") |
|
|
|
# Usage |
|
model_path = "cazzz307/MT5-SMALL-indicvoice-ic-63500-mt5-small" # Hugging Face model path |
|
tokenizer_path = "cazzz307/MT5-SMALL-indicvoice-ic-63500-mt5-small" # Same as model path (adjust if tokenizer is separate) |
|
|
|
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") |
|
model, tokenizer = load_model_and_tokenizer(model_path, tokenizer_path) |
|
|
|
input_csv = "your_input.csv" # Replace with your input file path |
|
output_csv = "predictions.csv" # Replace with your desired output file path |
|
run_inference(input_csv, output_csv, model, tokenizer) |
|
``` |
|
|
|
**Note**: Adjust the `tokenizer_path` if your tokenizer files are in a separate location or subdirectory. |
|
|