🌱 PlantDeBERTa: A Domain-Adapted Language Model for Plant Stress and Response NER

Model Description

PlantDeBERTa is a DeBERTa-based transformer model fine-tuned for Named Entity Recognition (NER) in the plant sciences, with a focus on lentil (Lens culinaris) stress-response literature. This model is part of a broader effort to enable structured knowledge extraction and ontology-aligned information retrieval in agricultural and biological NLP.

The model was trained on a custom-annotated corpus grounded in the Crop Ontology and enriched with part-of-speech (POS) tags and heuristic post-processing. It supports high-resolution tagging of diverse biological entities and responses across molecular, physiological, biochemical, and agronomic categories.

  • Base model: DeBERTa base
  • Downstream task: Token Classification (NER)
  • Domain: Plant biology, crop stress literature
  • Dataset: Custom annotated corpus from scientific literature and plant stress databases

🧠 Intended Use

This model is designed for:

  • High-precision NER for crop stress-response studies
  • Knowledge graph population in plant biology
  • Semantic indexing of agricultural literature
  • Plant trait mining and ontology enrichment
  • Supporting digital breeding tools and phenomics

πŸ“Š Performance

The model was evaluated on a domain-specific validation set and achieved the following:

Metric Score
Accuracy 88.38%
Weighted Precision 96.10%
Weighted Recall 95.22%
Weighted F1 95.49%

πŸ“¦ How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model = AutoModelForTokenClassification.from_pretrained("PHENOMA/PlantDeBERTa")
tokenizer = AutoTokenizer.from_pretrained("PHENOMA/PlantDeBERTa")

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

text = "Most of their known homologs coded for glycine-rich, cold and drought-regulated proteins, dormancy-associated proteins, proline-rich proteins (PRPs), and other membrane proteins."
results = ner_pipeline(text)

print(results)

πŸ§ͺ Training Details

  • Optimizer: AdamW
  • Learning Rates: Grid-searched over [1e-5, 5e-5]
  • Batch Sizes: [8, 16, 32]
  • Epochs: [3, 5, 10]
  • Loss Function: Custom weighted CrossEntropyLoss to handle class imbalance
  • Special weights applied to reduce overemphasis on the 'O' (outside) label
  • Evaluation Metric: Weighted F1-score (primary for model selection)
  • Trainer: Custom Trainer subclass with weighted loss
  • Device: GPU (CUDA)
  • Tokenization: Subword-aligned using Hugging Face’s AutoTokenizer with post-processing for B-/I- consistency

🧬 Label Set

This model supports a custom label set, e.g.:

  • B_PLANT, I_PLANT
  • B_BIOTIC_STRESS, I_BIOTIC_STRESS
  • B_ABIOTIC_STRESS, I_ABIOTIC_STRESS
  • B_AGRONOMIC_RESPONSES, I_AGRONOMIC_RESPONSES
  • B_PHYSIOLOGICAL_RESPONSES, I_PHYSIOLOGICAL_RESPONSES
  • B_BIOCHEMICAL_RESPONSES, I_BIOCHEMICAL_RESPONSES
  • B_MOLECULAR_RESPONSES, I_MOLECULAR_RESPONSES
  • INCREASE, INDUCES_CHANGES, REGULATES, DECREASE
  • REDUCE, INFLUENCES, AFFECTS, AFFECTED, AUGMENT
  • O (outside)

πŸ“š Dataset

PlantDeBERTa was trained on a corpus of 142 annotated abstracts related to lentil stress responses, curated from ScienceDirect, SpringerLink, Scopus, etc. Annotations were performed by plant science experts, ensuring semantic and ontological consistency.

  • Annotation schema: Aligned with Crop Ontology

  • POS tagging and rule-based corrections applied

  • Inter-annotator agreement: ΞΊ = 0.78


πŸ”¬ Limitations

Domain-specific: May underperform on general-purpose text

Limited multilingual support (English only)

Entity boundary fuzziness for novel expressions


🀝 Citation and License

This model is released under the MIT License.

If you use this model in your research or application, please consider citing:

@misc{khey2025plantdebertaopensourcelanguage,
      title={PlantDeBERTa: An Open Source Language Model for Plant Science}, 
      author={Hiba Khey and Amine Lakhder and Salma Rouichi and Imane El Ghabi and Kamal Hejjaoui and Younes En-nahli and Fahd Kalloubi and Moez Amri},
      year={2025},
      eprint={2506.08897},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.08897}, 
}

✨ Contact & Contributions

Questions or issues? Open an issue or email us.

We welcome contributions and collaborations related to plant NER, knowledge graphs, or domain-specific NLP!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for PHENOMA/PlantDeBERTa

Finetuned
(388)
this model