Teacher Model: Vision-Language Model for Transliteration of Modi Script to Devanagari
Introduction
This repository hosts the official teacher model weights as described in the paper:
Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari
Paper Link (arXiv:2503.13060)
Accepted at ICDAR 2025
Our model introduces a novel Vision-Language framework, leveraging the gemma-3-12b-it
base, to automatically transliterate the historic Modi script into modern Devanagari, supporting research and digital preservation of rare manuscripts.
Model Description
- Architecture: Vision-Language Model (VLM) based on
gemma-3-12b-it
- Task: End-to-end transliteration of scanned Modi script images into Devanagari text.
- Teacher Model: This release contains the weights of the teacher model used for training and evaluation in the referenced paper.
- Dataset: Fine-tuned and evaluated on the Historic Modi-Devanagari VLM dataset, introduced in the paper.
Installation
pip3 install pillow
pip3 install torch torchvision
pip3 install transformers peft accelerate
How to Use
from transformers import AutoProcessor, AutoModelForImageTextToText, AutoConfig
from PIL import Image
import torch
import torch.nn.functional as F
from peft import PeftModel
device = "cuda:0"
model_id = "google/gemma-3-12b-it"
peft_model_path = "historyHulk/ModiTrans-12B-Gemma-Teacher"
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map=device
)
model = PeftModel.from_pretrained(
model,
peft_model_path,
device_map=device,
torch_dtype=torch.bfloat16
)
image = Image.open("<Modi Script Image Preprocessed as in Dataset>").convert("RGB").resize((1024,512))
processor = AutoProcessor.from_pretrained(model_id)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image":image
},
{
"type": "text",
"text": "Translitrate the following Modi script to Devnagri script."
},
],
},
{
"role": "assistant",
"content": [
{
"type": "text",
},
],
},
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)
input_len = inputs["input_ids"].shape[-1]
pixel_values = inputs['pixel_values']
pixel_values = pixel_values.to(dtype=model.dtype, device=model.device)
model.eval()
with torch.no_grad():
input_ids = inputs["input_ids"]
attention_masks = inputs["attention_mask"]
pixel_values=pixel_values
while True:
outputs = model(
input_ids=input_ids,
attention_mask=attention_masks,
pixel_values=pixel_values,
)
logits = outputs.logits[:,-1,:]
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
input_ids = torch.cat([input_ids, next_token], dim=-1)
attention_masks = torch.cat([attention_masks, torch.ones_like(next_token)], dim=-1)
if next_token.item() == processor.tokenizer.eos_token_id or input_ids.shape[1] >= 350:
break
generation = input_ids[:,input_len:][0]
generated_text = processor.decode(generation, skip_special_tokens=True)
print("\n\n\n")
print(generated_text)
Citation
If you use this model in your research or publications, please cite the following paper:
@article{kausadikar2025historic,
title={Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari},
author={Kausadikar, Harshal and Kale, Tanvi and Susladkar, Onkar and Mittal, Sparsh},
journal={arXiv preprint arXiv:2503.13060},
year={2025}
}
- Downloads last month
- 12
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support