🌱 PlantDeBERTa: A Domain-Adapted Language Model for Plant Stress and Response NER

Model Description

PlantDeBERTa is a DeBERTa-based transformer model fine-tuned for Named Entity Recognition (NER) in the plant sciences, with a focus on lentil (Lens culinaris) stress-response literature. This model is part of a broader effort to enable structured knowledge extraction and ontology-aligned information retrieval in agricultural and biological NLP.

The model was trained on a custom-annotated corpus grounded in the Crop Ontology and enriched with part-of-speech (POS) tags and heuristic post-processing. It supports high-resolution tagging of diverse biological entities and responses across molecular, physiological, biochemical, and agronomic categories.

Base model: DeBERTa base
Downstream task: Token Classification (NER)
Domain: Plant biology, crop stress literature
Dataset: Custom annotated corpus from scientific literature and plant stress databases

🧠 Intended Use

This model is designed for:

High-precision NER for crop stress-response studies
Knowledge graph population in plant biology
Semantic indexing of agricultural literature
Plant trait mining and ontology enrichment
Supporting digital breeding tools and phenomics

📊 Performance

The model was evaluated on a domain-specific validation set and achieved the following:

Metric	Score
Accuracy	88.38%
Weighted Precision	96.10%
Weighted Recall	95.22%
Weighted F1	95.49%

📦 How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model = AutoModelForTokenClassification.from_pretrained("PHENOMA/PlantDeBERTa")
tokenizer = AutoTokenizer.from_pretrained("PHENOMA/PlantDeBERTa")

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

text = "Most of their known homologs coded for glycine-rich, cold and drought-regulated proteins, dormancy-associated proteins, proline-rich proteins (PRPs), and other membrane proteins."
results = ner_pipeline(text)

print(results)

🧪 Training Details

Optimizer: AdamW
Learning Rates: Grid-searched over [1e-5, 5e-5]
Batch Sizes: [8, 16, 32]
Epochs: [3, 5, 10]
Loss Function: Custom weighted CrossEntropyLoss to handle class imbalance
Special weights applied to reduce overemphasis on the 'O' (outside) label
Evaluation Metric: Weighted F1-score (primary for model selection)
Trainer: Custom Trainer subclass with weighted loss
Device: GPU (CUDA)
Tokenization: Subword-aligned using Hugging Face’s AutoTokenizer with post-processing for B-/I- consistency

🧬 Label Set

This model supports a custom label set, e.g.:

B_PLANT, I_PLANT
B_BIOTIC_STRESS, I_BIOTIC_STRESS
B_ABIOTIC_STRESS, I_ABIOTIC_STRESS
B_AGRONOMIC_RESPONSES, I_AGRONOMIC_RESPONSES
B_PHYSIOLOGICAL_RESPONSES, I_PHYSIOLOGICAL_RESPONSES
B_BIOCHEMICAL_RESPONSES, I_BIOCHEMICAL_RESPONSES
B_MOLECULAR_RESPONSES, I_MOLECULAR_RESPONSES
INCREASE, INDUCES_CHANGES, REGULATES, DECREASE
REDUCE, INFLUENCES, AFFECTS, AFFECTED, AUGMENT
O (outside)

📚 Dataset

PlantDeBERTa was trained on a corpus of 142 annotated abstracts related to lentil stress responses, curated from ScienceDirect, SpringerLink, Scopus, etc. Annotations were performed by plant science experts, ensuring semantic and ontological consistency.

Annotation schema: Aligned with Crop Ontology
POS tagging and rule-based corrections applied
Inter-annotator agreement: κ = 0.78

🔬 Limitations

Domain-specific: May underperform on general-purpose text

Limited multilingual support (English only)

Entity boundary fuzziness for novel expressions

🤝 Citation and License

This model is released under the MIT License.

If you use this model in your research or application, please consider citing:

@misc{khey2025plantdebertaopensourcelanguage,
      title={PlantDeBERTa: An Open Source Language Model for Plant Science}, 
      author={Hiba Khey and Amine Lakhder and Salma Rouichi and Imane El Ghabi and Kamal Hejjaoui and Younes En-nahli and Fahd Kalloubi and Moez Amri},
      year={2025},
      eprint={2506.08897},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.08897}, 
}

✨ Contact & Contributions

Developed by: Hiba KHEY and Amine Lakhder
Contact: [email protected], [email protected]

Questions or issues? Open an issue or email us.

PHENOMA
/

PlantDeBERTa