🧾 Persian Legal Text Simplification with ParsT5 + Unlimiformer

This model is part of the first benchmark for Persian legal text simplification. It fine-tunes the ParsT5-base encoder-decoder model with Unlimiformer extension to handle long legal documents efficiently—without truncation.

🔗 Project GitHub: mrjoneidi/Simplification-Legal-Texts


🧠 Model Description

  • Base Model: Ahmad/parsT5-base
  • Extended with: Unlimiformer for long-input attention
  • Training Data: 5000+ Persian legal rulings, simplified using GPT-4o (ChatGPT) prompts
  • Max Input Length: ~16,000 tokens (with Unlimiformer)
  • Fine-tuned on: Simplified dataset generated from judicial rulings
  • Task: Text simplification (formal → public-friendly)

✨ Example Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("mrjoneidi/parsT5-legal-simplification")
tokenizer = AutoTokenizer.from_pretrained("mrjoneidi/parsT5-legal-simplification")

input_text = "دادگاه با بررسی مدارک موجود، حکم به رد دعوی صادر می‌نماید..."
inputs = tokenizer(input_text, return_tensors="pt", truncation=True)

output_ids = model.generate(**inputs, max_new_tokens=512)
simplified = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(simplified)

🛠 Training Details

  • Platform: Kaggle P100 GPU
  • Optimizer: AdamW (best performance among AdamW, LAMB, SGD)
  • Configurations:
    • 1 vs. 3 unfrozen encoder-decoder blocks
    • Best results with 3-block configuration

More detail on GitHub

Downloads last month
8
Safetensors
Model size
248M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Moryjj/simplification-legal-text

Base model

Ahmad/parsT5-base
Finetuned
(1)
this model

Dataset used to train Moryjj/simplification-legal-text