--- language: - hi - bn - te - ta - gu - kn - ml - or - pa - as library_name: transformers base_model: google/byt5-small tags: - text-generation - asr-error-correction - indic-languages - multilingual - byt5 datasets: - indicvoice pipeline_tag: text-generation license: apache-2.0 model-index: - name: BYT5-SMALL-IndicVoice-W2V-IC-byt5small results: [] --- # BYT5-SMALL-IndicVoice-W2V-IC-byt5small This is a fine-tuned ByT5 Small model trained on Indic ASR data using Hugging Face Transformers. The goal of this model is to correct post-ASR transcription errors in Indic languages. Fine-tuning large multilingual models like ByT5 on domain-specific datasets like ASR outputs allows improved performance, especially in noisy or low-resource environments. These models are powerful at generalizing across varied languages and dialects when backed by high-quality finetuning datasets. ### 📚 Datasets Used - Indicvoice ### 🎧 Transcription Models - Indic Conformer (IC) - Wav2Vec 2.0 (W2V) --- ### ⚙️ Training Info - Trained over a period of 6 months - Used A100 GPUs - Developed as part of a research collaboration with **IIT Bombay** - Focused on improving transcription accuracy of ASR systems in Indic languages ### 🚀 Usage ```python import pandas as pd from datasets import Dataset from transformers import AutoTokenizer, AutoModelForSeq2SeqLM import torch def load_model_and_tokenizer(model_path, tokenizer_path): print("Loading model and tokenizer...") model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device) tokenizer = AutoTokenizer.from_pretrained(tokenizer_path) return model, tokenizer def run_inference(input_csv_path, output_csv_path, model, tokenizer): print(f"Loading data from {input_csv_path}...") data_df = pd.read_csv(input_csv_path, header=None) data_df.columns = ['Hypothesis', 'Corrected Hypothesis'] dataset = Dataset.from_pandas(data_df.rename(columns={'Hypothesis': 'input', 'Corrected Hypothesis': 'target'})) predictions = [] print("Running inference...") for item in dataset: input_text = item['input'] input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device) outputs = model.generate(input_ids, max_length=512) decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True) predictions.append(decoded_output) data_df['Predictions'] = predictions data_df.to_csv(output_csv_path, index=False) print(f"Predictions saved to {output_csv_path}") # Usage model_path = "cazzz307/BYT5-SMALL-IndicVoice-W2V-IC-byt5small" # Hugging Face model path tokenizer_path = "cazzz307/BYT5-SMALL-IndicVoice-W2V-IC-byt5small" # Same as model path (adjust if tokenizer is separate) device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") model, tokenizer = load_model_and_tokenizer(model_path, tokenizer_path) input_csv = "your_input.csv" # Replace with your input file path output_csv = "predictions.csv" # Replace with your desired output file path run_inference(input_csv, output_csv, model, tokenizer) ``` **Note**: Adjust the `tokenizer_path` if your tokenizer files are in a separate location or subdirectory.