Image classification using fine-tuned ViT - for historical :bowtie: documents sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Scope: Processing of images, training and evaluation of ViT model, input file/directory processing, class 🏷️ (category) results of top N predictions output, predictions summarizing into a tabular format, HF 😊 hub support for the model

Model description πŸ“‡

πŸ”² Fine-tuned model repository: vit-historical-page ^1 πŸ”—

πŸ”³ Base model repository: google's vit-base-patch16-224 ^2 πŸ”—

Data πŸ“œ

Training set of the model: 8950 images

Categories 🏷️

Label️ Ratio Description
DRAW 11.89% πŸ“ˆ - drawings, maps, paintings with text
DRAW_L 8.17% πŸ“ˆπŸ“ - drawings ... with a table legend or inside tabular layout / forms
LINE_HW 5.99% βœοΈπŸ“ - handwritten text lines inside tabular layout / forms
LINE_P 6.06% πŸ“ - printed text lines inside tabular layout / forms
LINE_T 13.39% πŸ“ - machine typed text lines inside tabular layout / forms
PHOTO 10.21% πŸŒ„ - photos with text
PHOTO_L 7.86% πŸŒ„πŸ“ - photos inside tabular layout / forms or with a tabular annotation
TEXT 8.58% πŸ“° - mixed types of printed and handwritten texts
TEXT_HW 7.36% βœοΈπŸ“„ - only handwritten text
TEXT_P 6.95% πŸ“„ - only printed text
TEXT_T 13.53% πŸ“„ - only machine typed text

Evaluation set (same proportions): 995 images

Data preprocessing

During training the following transforms were applied randomly with a 50% chance:

  • transforms.ColorJitter(brightness 0.5)
  • transforms.ColorJitter(contrast 0.5)
  • transforms.ColorJitter(saturation 0.5)
  • transforms.ColorJitter(hue 0.5)
  • transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
  • transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

Training Hyperparameters

  • eval_strategy "epoch"
  • save_strategy "epoch"
  • learning_rate 5e-5
  • per_device_train_batch_size 8
  • per_device_eval_batch_size 8
  • num_train_epochs 3
  • warmup_ratio 0.1
  • logging_steps 10
  • load_best_model_at_end True
  • metric_for_best_model "accuracy"

Results πŸ“Š

Evaluation set's accuracy (Top-3): 99.6%

TOP-3 confusion matrix - trained ViT

Evaluation set's accuracy (Top-1): 97.3%

TOP-1 confusion matrix - trained ViT

Result tables

Table columns

  • FILE - name of the file
  • PAGE - number of the page
  • CLASS-N - label of the category 🏷️, guess TOP-N
  • SCORE-N - score of the category 🏷️, guess TOP-N
  • TRUE - actual label of the category 🏷️

Contacts πŸ“§

For support write to πŸ“§ [email protected] πŸ“§

Official repository: UFAL ^3

Acknowledgements πŸ™

  • Developed by UFAL ^5 πŸ‘₯
  • Funded by ATRIUM ^4 πŸ’°
  • Shared by ATRIUM ^4 & UFAL ^5
  • Model type: fine-tuned ViT ^2 with a 224x224 resolution size

©️ 2022 UFAL & ATRIUM

Downloads last month
56
Safetensors
Model size
85.8M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for k4tel/vit-historical-page

Finetuned
(659)
this model