Is it possible to fine-tune OLM2OCR on a small custom dataset?

#4
by JamesGs - opened

Hi, and thanks for the great work on OLM2OCR!

I have a question regarding fine-tuning. From the README, it seems that the provided configs are designed for large-scale training (around 270K pages per epoch). However, the line that says:

“We hope to add more options to make further finetuning your own small model more simple and easy.” https://github.com/allenai/olmocr/tree/main/olmocr/train#:~:text=We%20hope%20to%20add%20more%20options%20to%20make%20further%20finetuning%20your%20own%20small%20model%20more%20simple%20and%20easy.

confused me a bit.

Does this mean it’s currently not possible (or not recommended) to fine-tune olm2OCR on a smaller dataset, e.g., a few thousand samples? Or would it still work if I adjust the dataset paths and training parameters accordingly?

I’d like to experiment with domain-specific data (complex tables, financial statements, etc.) but on a much smaller scale. Any guidance or best practices for this would be greatly appreciated!

Thanks in advance!

Sign up or log in to comment