π± PlantDeBERTa: A Domain-Adapted Language Model for Plant Stress and Response NER
Model Description
PlantDeBERTa is a DeBERTa-based transformer model fine-tuned for Named Entity Recognition (NER) in the plant sciences, with a focus on lentil (Lens culinaris) stress-response literature. This model is part of a broader effort to enable structured knowledge extraction and ontology-aligned information retrieval in agricultural and biological NLP.
The model was trained on a custom-annotated corpus grounded in the Crop Ontology and enriched with part-of-speech (POS) tags and heuristic post-processing. It supports high-resolution tagging of diverse biological entities and responses across molecular, physiological, biochemical, and agronomic categories.
- Base model: DeBERTa base
- Downstream task: Token Classification (NER)
- Domain: Plant biology, crop stress literature
- Dataset: Custom annotated corpus from scientific literature and plant stress databases
π§ Intended Use
This model is designed for:
- High-precision NER for crop stress-response studies
- Knowledge graph population in plant biology
- Semantic indexing of agricultural literature
- Plant trait mining and ontology enrichment
- Supporting digital breeding tools and phenomics
π Performance
The model was evaluated on a domain-specific validation set and achieved the following:
Metric | Score |
---|---|
Accuracy | 88.38% |
Weighted Precision | 96.10% |
Weighted Recall | 95.22% |
Weighted F1 | 95.49% |
π¦ How to Use
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model = AutoModelForTokenClassification.from_pretrained("PHENOMA/PlantDeBERTa")
tokenizer = AutoTokenizer.from_pretrained("PHENOMA/PlantDeBERTa")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)
text = "Most of their known homologs coded for glycine-rich, cold and drought-regulated proteins, dormancy-associated proteins, proline-rich proteins (PRPs), and other membrane proteins."
results = ner_pipeline(text)
print(results)
π§ͺ Training Details
- Optimizer: AdamW
- Learning Rates: Grid-searched over
[1e-5, 5e-5]
- Batch Sizes:
[8, 16, 32]
- Epochs:
[3, 5, 10]
- Loss Function: Custom weighted
CrossEntropyLoss
to handle class imbalance - Special weights applied to reduce overemphasis on the
'O'
(outside) label - Evaluation Metric: Weighted F1-score (primary for model selection)
- Trainer: Custom
Trainer
subclass with weighted loss - Device: GPU (CUDA)
- Tokenization: Subword-aligned using Hugging Faceβs
AutoTokenizer
with post-processing for B-/I- consistency
𧬠Label Set
This model supports a custom label set, e.g.:
B_PLANT
,I_PLANT
B_BIOTIC_STRESS
,I_BIOTIC_STRESS
B_ABIOTIC_STRESS
,I_ABIOTIC_STRESS
B_AGRONOMIC_RESPONSES
,I_AGRONOMIC_RESPONSES
B_PHYSIOLOGICAL_RESPONSES
,I_PHYSIOLOGICAL_RESPONSES
B_BIOCHEMICAL_RESPONSES
,I_BIOCHEMICAL_RESPONSES
B_MOLECULAR_RESPONSES
,I_MOLECULAR_RESPONSES
INCREASE
,INDUCES_CHANGES
,REGULATES
,DECREASE
REDUCE
,INFLUENCES
,AFFECTS
,AFFECTED
,AUGMENT
O
(outside)
π Dataset
PlantDeBERTa was trained on a corpus of 142 annotated abstracts related to lentil stress responses, curated from ScienceDirect, SpringerLink, Scopus, etc. Annotations were performed by plant science experts, ensuring semantic and ontological consistency.
Annotation schema: Aligned with Crop Ontology
POS tagging and rule-based corrections applied
Inter-annotator agreement: ΞΊ = 0.78
π¬ Limitations
Domain-specific: May underperform on general-purpose text
Limited multilingual support (English only)
Entity boundary fuzziness for novel expressions
π€ Citation and License
This model is released under the MIT License.
If you use this model in your research or application, please consider citing:
@misc{khey2025plantdebertaopensourcelanguage,
title={PlantDeBERTa: An Open Source Language Model for Plant Science},
author={Hiba Khey and Amine Lakhder and Salma Rouichi and Imane El Ghabi and Kamal Hejjaoui and Younes En-nahli and Fahd Kalloubi and Moez Amri},
year={2025},
eprint={2506.08897},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.08897},
}
β¨ Contact & Contributions
- Developed by: Hiba KHEY and Amine Lakhder
- Contact: [email protected], [email protected]
Questions or issues? Open an issue or email us.
We welcome contributions and collaborations related to plant NER, knowledge graphs, or domain-specific NLP!
Model tree for PHENOMA/PlantDeBERTa
Base model
microsoft/deberta-v3-base