Teacher Model: Vision-Language Model for Transliteration of Modi Script to Devanagari

Introduction

This repository hosts the official teacher model weights as described in the paper:

Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari
Paper Link (arXiv:2503.13060)
Accepted at ICDAR 2025

Our model introduces a novel Vision-Language framework, leveraging the gemma-3-12b-it base, to automatically transliterate the historic Modi script into modern Devanagari, supporting research and digital preservation of rare manuscripts.

Model Description

Architecture: Vision-Language Model (VLM) based on gemma-3-12b-it
Task: End-to-end transliteration of scanned Modi script images into Devanagari text.
Teacher Model: This release contains the weights of the teacher model used for training and evaluation in the referenced paper.
Dataset: Fine-tuned and evaluated on the Historic Modi-Devanagari VLM dataset, introduced in the paper.

Installation

pip3 install pillow
pip3 install torch torchvision
pip3 install transformers peft accelerate

How to Use

from transformers import AutoProcessor, AutoModelForImageTextToText, AutoConfig
from PIL import Image
import torch
import torch.nn.functional as F
from peft import PeftModel 

device = "cuda:0"
model_id = "google/gemma-3-12b-it"
peft_model_path = "historyHulk/ModiTrans-12B-Gemma-Teacher"

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map=device
)

model = PeftModel.from_pretrained(
    model,
    peft_model_path,
    device_map=device,
    torch_dtype=torch.bfloat16
)

image = Image.open("<Modi Script Image Preprocessed as in Dataset>").convert("RGB").resize((1024,512))

processor = AutoProcessor.from_pretrained(model_id)

messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "image":image
                    },
                    {
                        "type": "text",
                        "text": "Translitrate the following Modi script to Devnagri script."
                    },
                ],
            },
            {
                "role": "assistant",
                "content": [
                    {
                        "type": "text",
                    },
                ],
            },
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]
pixel_values = inputs['pixel_values']
pixel_values = pixel_values.to(dtype=model.dtype, device=model.device)

model.eval()
with torch.no_grad():
    input_ids = inputs["input_ids"]
    attention_masks = inputs["attention_mask"]
    pixel_values=pixel_values
    while True:
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_masks,
            pixel_values=pixel_values,
        )

        logits = outputs.logits[:,-1,:]
        probs = F.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        input_ids = torch.cat([input_ids, next_token], dim=-1)
        attention_masks = torch.cat([attention_masks, torch.ones_like(next_token)], dim=-1)
        if next_token.item() == processor.tokenizer.eos_token_id or input_ids.shape[1] >= 350:
            break

generation = input_ids[:,input_len:][0]
generated_text = processor.decode(generation, skip_special_tokens=True)
print("\n\n\n")
print(generated_text)

Citation

If you use this model in your research or publications, please cite the following paper:

@article{kausadikar2025historic,
  title={Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari},
  author={Kausadikar, Harshal and Kale, Tanvi and Susladkar, Onkar and Mittal, Sparsh},
  journal={arXiv preprint arXiv:2503.13060},
  year={2025}
}

historyHulk
/

ModiTrans-12B-Gemma-Teacher