You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

BYT5-SMALL-IndicVoice-with-different-models-hypothesis-IC-W2V-other-ASR-dataset-W2V

This is a fine-tuned ByT5 Small model trained on Indic ASR data using Hugging Face Transformers. The goal of this model is to correct post-ASR transcription errors in Indic languages.

Fine-tuning large multilingual models like ByT5 on domain-specific datasets like ASR outputs allows improved performance, especially in noisy or low-resource environments. These models are powerful at generalizing across varied languages and dialects when backed by high-quality finetuning datasets.

πŸ“š Datasets Used

  • Indicvoice

🎧 Transcription Models

  • Indic Conformer (IC)
  • Wav2Vec 2.0 (W2V)

πŸ”¬ Mixed Dataset Hypothesis

  • This model uses a combination of Kathbath, Sruthilipi, and ITTM datasets.

βš™οΈ Training Info

  • Trained over a period of 6 months
  • Used A100 GPUs
  • Developed as part of a research collaboration with IIT Bombay
  • Focused on improving transcription accuracy of ASR systems in Indic languages

πŸš€ Usage

import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

def load_model_and_tokenizer(model_path, tokenizer_path):
    print("Loading model and tokenizer...")
    model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device)
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
    return model, tokenizer

def run_inference(input_csv_path, output_csv_path, model, tokenizer):
    print(f"Loading data from {input_csv_path}...")
    data_df = pd.read_csv(input_csv_path, header=None)
    data_df.columns = ['Hypothesis', 'Corrected Hypothesis']

    dataset = Dataset.from_pandas(data_df.rename(columns={'Hypothesis': 'input', 'Corrected Hypothesis': 'target'}))

    predictions = []

    print("Running inference...")
    for item in dataset:
        input_text = item['input']

        input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)
        outputs = model.generate(input_ids, max_length=512)
        decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

        predictions.append(decoded_output)

    data_df['Predictions'] = predictions
    data_df.to_csv(output_csv_path, index=False)
    print(f"Predictions saved to {output_csv_path}")

# Usage
model_path = "cazzz307/BYT5-SMALL-IndicVoice-with-different-models-hypothesis-IC-W2V-other-ASR-dataset-W2V"  # Hugging Face model path
tokenizer_path = "cazzz307/BYT5-SMALL-IndicVoice-with-different-models-hypothesis-IC-W2V-other-ASR-dataset-W2V"  # Same as model path (adjust if tokenizer is separate)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model, tokenizer = load_model_and_tokenizer(model_path, tokenizer_path)

input_csv = "your_input.csv"  # Replace with your input file path
output_csv = "predictions.csv"  # Replace with your desired output file path
run_inference(input_csv, output_csv, model, tokenizer)

Note: Adjust the tokenizer_path if your tokenizer files are in a separate location or subdirectory.

Downloads last month
-
Safetensors
Model size
300M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cazzz307/BYT5-SMALL-IndicVoice-with-different-models-hypothesis-IC-W2V-other-ASR-dataset-W2V

Base model

google/byt5-small
Finetuned
(91)
this model

Collection including cazzz307/BYT5-SMALL-IndicVoice-with-different-models-hypothesis-IC-W2V-other-ASR-dataset-W2V