T5 Spelling Corrector Fine-tuned v3 - Indian Financial Text Domain

Model Description

This T5-based spelling correction model has been specifically fine-tuned on Indian financial documents containing monetary amounts, numeric text, and financial terminology written in English. The model addresses the critical challenge of correcting OCR errors and phonetic misspellings commonly found in Indian financial documents, invoices, receipts, and banking records.

Developed by: Ayaan-Sharif
Model type: T5ForConditionalGeneration
Base model: T5-base (768 hidden size, 12 layers, 12 heads)
Language: English
Domain: Indian Financial Documents
License: MIT

Intended Use

The model excels at post-OCR text correction for Indian financial documents, where scanning quality issues create systematic spelling errors in monetary amounts. It's designed to handle:

Invoice processing - Correcting scanned invoice amounts
Banking document digitization - Cleaning up amount fields from cheques, passbooks, statements
Accounting system data entry - Preprocessing before numerical extraction
Legal/compliance documents - Ensuring accurate monetary representation in contracts and agreements

Usage

Quick Start

from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load model and tokenizer
model_name = "Ayaan-Sharif/t5-spelling-corrector-finetuned-v3"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Example input (financial amount with OCR errors)
text = "One Hunderd Thousand Rupees"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50, num_beams=4, early_stopping=True)
corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Original:", text)
print("Corrected:", corrected)

Sample Output

Input: "One Hunderd Thousand Rupees"
Output: "One Hundred Thousand Rupees"

Limitations

Specifically trained on Indian financial terminology and monetary amounts; may not perform well on general English text
Performance depends on the type and severity of OCR/phonetic errors in financial contexts
May not handle context-dependent corrections outside financial domain
Not evaluated on non-English languages or non-financial text

Training Details

Fine-tuned from T5-base checkpoint
Training data: Over 9,000+ samples of Indian financial documents
Covers corrections for:
- Crore/Lakh denominations: "krore", "cor e", "laakh", "lkhs" → "Crore", "Lakh"
- Thousand/Hundred: "thosand", "thousnd", "hundad", "hunderd" → "Thousand", "Hundred"
- Complex amounts: Ranges from thousands to hundreds of crores
- Currency terms: "Paise", "Rupees", proper capitalization
- Contextual markers: Preserves "[invoice no]", "(pvt ltd)", etc.
Training progression: v1 (base), v2 (Lakh/Crore patterns), v3 (comprehensive 9000+ samples)
Uses SentencePiece tokenizer (spiece.model)
Generation config: decoder_start_token_id=0, eos_token_id=1, pad_token_id=0

Files in this Repository

config.json: Model configuration
model.safetensors: Model weights (SafeTensors format)
tokenizer_config.json: Tokenizer configuration
spiece.model: SentencePiece model
special_tokens_map.json: Special tokens mapping
added_tokens.json: Added tokens
generation_config.json: Generation parameters
Training artifacts: trainer_state.json, training_args.bin, optimizer.pt, etc.

Requirements

Python 3.8+
transformers >= 4.21.0
torch >= 1.9.0
safetensors

Install with: pip install transformers torch safetensors

Citation

If you use this model, please cite:

@misc{t5-spelling-corrector-indian-financial-v3,
  title={T5 Spelling Corrector Fine-tuned v3 - Indian Financial Text Domain},
  author={Ayaan-Sharif},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/Ayaan-Sharif/t5-spelling-corrector-finetuned-v3}
}

Contact

For questions or issues, please open an issue on this repository or contact the maintainer.

Downloads last month: 38

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Ayaan-Sharif
/

t5-spelling-corrector-finetuned-v3