A Proposal of Post-OCR Spelling Correction Using Monolingual Byte-level Language Models

Abstract

This work presents a proposal for a spelling corrector using monolingual byte-level language models (Monobyte) for the post-OCR task in texts produced by Handwritten Text Recognition (HTR) systems. We evaluate three Monobyte models, based on Google’s ByT5, trained separately on English, French, and Brazilian Portuguese. The experiments evaluated three datasets with 21st century manuscripts: IAM, RIMES, and BRESSAY. In the IAM, Monobyte achieves reductions of 2.24% in character error rate (CER) and 26.37% in word error rate (WER). In RIMES, reductions are 13.48% (CER) and 33.34% (WER), while in BRESSAY, Monobyte improves CER by 12.78% and WER by 40.62%. The BRESSAY results surpass results reported in previous works using a multilingual ByT5 model. Our findings demonstrate the effectiveness of byte-level tokenization in noisy text and underscore the potential of computationally efficient, monolingual models.

Link: https://dl.acm.org/doi/10.1145/3704268.3748673 - Repository: https://github.com/savi8sant8s/monobyte-spelling-corrector

Models

These are fine-tuned models based on the Monobyte architecture, specifically trained for spelling correction tasks. The models are available in the following directories:

  • models/en: English (IAM dataset)
  • models/fr: French (RIMES dataset)
  • models/pt: Brazilian Portuguese (BRESSAY dataset)

Citation

@inproceedings{araujo2025proposal,
  author    = {Sávio Santos de Araújo and Byron Leite Dantas Bezerra and Arthur Flor de Souza Neto},
  title     = {A Proposal of Post-OCR Spelling Correction Using Monolingual Byte-level Language Models},
  booktitle = {Proceedings of the ACM Symposium on Document Engineering 2025 (DocEng '25)},
  year      = {2025},
  publisher = {ACM},
  doi       = {10.1145/3704268.3748673}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for savi8sant8s/monobyte-spelling-corrector

Finetuned
(1)
this model

Datasets used to train savi8sant8s/monobyte-spelling-corrector

Collection including savi8sant8s/monobyte-spelling-corrector