T5 Spelling Corrector Fine-tuned v3 - Indian Financial Text Domain

Model Description

This T5-based spelling correction model has been specifically fine-tuned on Indian financial documents containing monetary amounts, numeric text, and financial terminology written in English. The model addresses the critical challenge of correcting OCR errors and phonetic misspellings commonly found in Indian financial documents, invoices, receipts, and banking records.

  • Developed by: Ayaan-Sharif
  • Model type: T5ForConditionalGeneration
  • Base model: T5-base (768 hidden size, 12 layers, 12 heads)
  • Language: English
  • Domain: Indian Financial Documents
  • License: MIT

Intended Use

The model excels at post-OCR text correction for Indian financial documents, where scanning quality issues create systematic spelling errors in monetary amounts. It's designed to handle:

  • Invoice processing - Correcting scanned invoice amounts
  • Banking document digitization - Cleaning up amount fields from cheques, passbooks, statements
  • Accounting system data entry - Preprocessing before numerical extraction
  • Legal/compliance documents - Ensuring accurate monetary representation in contracts and agreements

Usage

Quick Start

from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load model and tokenizer
model_name = "Ayaan-Sharif/t5-spelling-corrector-finetuned-v3"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Example input (financial amount with OCR errors)
text = "One Hunderd Thousand Rupees"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50, num_beams=4, early_stopping=True)
corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Original:", text)
print("Corrected:", corrected)

Sample Output

  • Input: "One Hunderd Thousand Rupees"
  • Output: "One Hundred Thousand Rupees"

Limitations

  • Specifically trained on Indian financial terminology and monetary amounts; may not perform well on general English text
  • Performance depends on the type and severity of OCR/phonetic errors in financial contexts
  • May not handle context-dependent corrections outside financial domain
  • Not evaluated on non-English languages or non-financial text

Training Details

  • Fine-tuned from T5-base checkpoint
  • Training data: Over 9,000+ samples of Indian financial documents
  • Covers corrections for:
    • Crore/Lakh denominations: "krore", "cor e", "laakh", "lkhs" โ†’ "Crore", "Lakh"
    • Thousand/Hundred: "thosand", "thousnd", "hundad", "hunderd" โ†’ "Thousand", "Hundred"
    • Complex amounts: Ranges from thousands to hundreds of crores
    • Currency terms: "Paise", "Rupees", proper capitalization
    • Contextual markers: Preserves "[invoice no]", "(pvt ltd)", etc.
  • Training progression: v1 (base), v2 (Lakh/Crore patterns), v3 (comprehensive 9000+ samples)
  • Uses SentencePiece tokenizer (spiece.model)
  • Generation config: decoder_start_token_id=0, eos_token_id=1, pad_token_id=0

Files in this Repository

  • config.json: Model configuration
  • model.safetensors: Model weights (SafeTensors format)
  • tokenizer_config.json: Tokenizer configuration
  • spiece.model: SentencePiece model
  • special_tokens_map.json: Special tokens mapping
  • added_tokens.json: Added tokens
  • generation_config.json: Generation parameters
  • Training artifacts: trainer_state.json, training_args.bin, optimizer.pt, etc.

Requirements

  • Python 3.8+
  • transformers >= 4.21.0
  • torch >= 1.9.0
  • safetensors

Install with: pip install transformers torch safetensors

Citation

If you use this model, please cite:

@misc{t5-spelling-corrector-indian-financial-v3,
  title={T5 Spelling Corrector Fine-tuned v3 - Indian Financial Text Domain},
  author={Ayaan-Sharif},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/Ayaan-Sharif/t5-spelling-corrector-finetuned-v3}
}

Contact

For questions or issues, please open an issue on this repository or contact the maintainer.

Downloads last month
38
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using Ayaan-Sharif/t5-spelling-corrector-finetuned-v3 1