bioner_medmentions_st21pv_finegrain

This is a named entity recognition model fine-tuned from the microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext model. It predicts spans with 91 possible labels. The labels are Acquired Abnormality, Amino Acid Sequence, Amino Acid, Peptide, or Protein, Amphibian, Anatomical Abnormality, Anatomical Structure, Animal, Antibiotic, Bacterium, Biologic Function, Biologically Active Substance, Biomedical Occupation or Discipline, Biomedical or Dental Material, Bird, Body Location or Region, Body Part, Organ, or Organ Component, Body Space or Junction, Body Substance, Body System, Cell, Cell Component, Cell Function, Cell or Molecular Dysfunction, Chemical, Chemical Viewed Functionally, Chemical Viewed Structurally, Classification, Clinical Attribute, Congenital Abnormality, Diagnostic Procedure, Disease or Syndrome, Drug Delivery Device, Element, Ion, or Isotope, Embryonic Structure, Enzyme, Eukaryote, Experimental Model of Disease, Finding, Fish, Food, Fully Formed Anatomical Structure, Fungus, Gene or Genome, Genetic Function, Geographic Area, Hazardous or Poisonous Substance, Health Care Activity, Health Care Related Organization, Hormone, Human, Idea or Concept, Immunologic Factor, Indicator, Reagent, or Diagnostic Aid, Injury or Poisoning, Inorganic Chemical, Intellectual Product, Laboratory Procedure, Laboratory or Test Result, Mammal, Medical Device, Mental Process, Mental or Behavioral Dysfunction, Molecular Biology Research Technique, Molecular Function, Molecular Sequence, Neoplastic Process, Nucleic Acid, Nucleoside, or Nucleotide, Nucleotide Sequence, Organ or Tissue Function, Organic Chemical, Organism Function, Organization, Pathologic Function, Pharmacologic Substance, Physiologic Function, Plant, Population Group, Professional Society, Professional or Occupational Group, Receptor, Regulation or Law, Reptile, Research Activity, Self-help or Relief Organization, Sign or Symptom, Spatial Concept, Therapeutic or Preventive Procedure, Tissue, Vertebrate, Virus and Vitamin.

The code used for training this model can be found at https://github.com/Glasgow-AI4BioMed/bioner along with links to other biomedical NER models trained on well-known biomedical corpora. The source dataset information is below.

Example Usage

The code below will load up the model and apply it to the provided text. It uses a simple aggregation strategy to post-process the individual tokens into larger multi-token entities where needed.

from transformers import pipeline

# Load the model as part of an NER pipeline
ner_pipeline = pipeline("token-classification", 
                        model="Glasgow-AI4BioMed/bioner_medmentions_st21pv_finegrain",
                        aggregation_strategy="max")

# Apply it to some text
ner_pipeline("EGFR T790M mutations have been known to affect treatment outcomes for NSCLC patients receiving erlotinib.")

# Output:
# [ {"entity_group": "Gene or Genome", "score": 0.96229, "word": "egfr", "start": 0, "end": 4},
#   {"entity_group": "Genetic Function", "score": 0.91988, "word": "t790m mutations", "start": 5, "end": 20},
#   {"entity_group": "Neoplastic Process", "score": 0.99883, "word": "nsclc", "start": 51, "end": 56},
#   {"entity_group": "Pharmacologic Substance", "score": 0.99931, "word": "erlotinib", "start": 76, "end": 85} ]

Dataset Info

Source: The ST21pv version of MedMentions was downloaded from: https://github.com/chanzuckerberg/MedMentions/tree/master/st21pv

The dataset should be cited with: Mohan, Sunil, and Donghui Li. "MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts." Automated Knowledge Base Construction (AKBC), 2019, https://openreview.net/forum?id=SylxCx5pTQ. DOI: 10.24432/C5G59C

An overview of semantic types can be found at: https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html

Preprocessing: The training, validation and test splits were maintained from the original dataset. Concept identifiers (CUIs) were used to map each annotation to its associated UMLS entry to recover semantic types (from the MRSTY.RRF UMLS file). Semantic types provided in MedMentions were not used. Annotations were mapped to specific semantic types names using the Semantic Groups file available at: https://www.nlm.nih.gov/research/umls/knowledge_sources/semantic_network/index.html. This contrasts with the finegrained version that mapped annotations to semantic groups. The preprocessing script for this dataset is prepare_medmentions.py with the --finegrain flag.

Performance

The span-level performance on the test split for the different labels are shown in the tables below. The full performance results are available in the model repo in Markdown format for viewing and JSON format for easier loading. These include the performance at token level (with individual B- and I- labels as the token classifier uses IOB2 token labelling).

Label Precision Recall F1-score Support
Acquired Abnormality 0.273 0.240 0.255 50
Amino Acid Sequence 0.303 0.357 0.328 84
Amino Acid, Peptide, or Protein 0.258 0.277 0.267 166
Anatomical Abnormality 0.190 0.145 0.164 76
Anatomical Structure 0.160 0.308 0.211 13
Animal 0.605 0.742 0.667 93
Antibiotic 0.835 0.784 0.808 148
Bacterium 0.752 0.732 0.742 448
Biologic Function 0.341 0.369 0.355 157
Biologically Active Substance 0.586 0.640 0.612 2080
Biomedical Occupation or Discipline 0.437 0.444 0.441 196
Biomedical or Dental Material 0.369 0.452 0.406 197
Bird 0.810 0.819 0.814 83
Body Location or Region 0.386 0.461 0.420 232
Body Part, Organ, or Organ Component 0.565 0.604 0.584 1092
Body Space or Junction 0.272 0.311 0.290 90
Body Substance 0.542 0.731 0.622 212
Body System 0.554 0.511 0.532 90
Cell 0.680 0.737 0.708 924
Cell Component 0.589 0.637 0.612 311
Cell Function 0.484 0.599 0.535 499
Cell or Molecular Dysfunction 0.607 0.657 0.631 99
Chemical 0.348 0.333 0.340 72
Chemical Viewed Functionally 0.286 0.432 0.344 37
Chemical Viewed Structurally 0.443 0.427 0.435 82
Classification 0.520 0.544 0.532 309
Clinical Attribute 0.596 0.625 0.610 323
Congenital Abnormality 0.438 0.443 0.440 79
Diagnostic Procedure 0.672 0.648 0.660 735
Disease or Syndrome 0.757 0.774 0.766 2199
Element, Ion, or Isotope 0.713 0.657 0.684 385
Embryonic Structure 0.587 0.509 0.545 53
Enzyme 0.766 0.761 0.763 681
Eukaryote 0.745 0.793 0.768 397
Experimental Model of Disease 0.286 0.356 0.317 45
Finding 0.391 0.388 0.389 2759
Fish 1.000 0.947 0.973 19
Food 0.556 0.455 0.501 336
Fully Formed Anatomical Structure 0.000 0.000 0.000 1
Fungus 0.819 0.798 0.809 119
Gene or Genome 0.551 0.539 0.545 912
Genetic Function 0.598 0.646 0.621 652
Geographic Area 0.673 0.712 0.692 598
Hazardous or Poisonous Substance 0.513 0.522 0.518 293
Health Care Activity 0.487 0.458 0.472 1061
Health Care Related Organization 0.531 0.642 0.581 296
Hormone 0.806 0.746 0.775 189
Human 0.799 0.880 0.837 158
Idea or Concept 0.000 0.000 0.000 1
Immunologic Factor 0.674 0.606 0.638 434
Indicator, Reagent, or Diagnostic Aid 0.427 0.451 0.439 182
Injury or Poisoning 0.617 0.703 0.657 357
Inorganic Chemical 0.611 0.680 0.643 256
Intellectual Product 0.495 0.485 0.490 2075
Laboratory Procedure 0.445 0.452 0.448 908
Laboratory or Test Result 0.183 0.196 0.190 112
Mammal 0.778 0.838 0.807 456
Medical Device 0.434 0.437 0.435 355
Mental Process 0.546 0.546 0.546 740
Mental or Behavioral Dysfunction 0.710 0.774 0.741 518
Molecular Biology Research Technique 0.500 0.539 0.519 206
Molecular Function 0.504 0.555 0.528 719
Molecular Sequence 0.417 0.556 0.476 9
Neoplastic Process 0.761 0.745 0.753 918
Nucleic Acid, Nucleoside, or Nucleotide 0.331 0.450 0.381 109
Nucleotide Sequence 0.320 0.491 0.387 110
Organ or Tissue Function 0.482 0.425 0.452 247
Organic Chemical 0.396 0.464 0.427 511
Organism Function 0.462 0.518 0.488 471
Organization 0.270 0.442 0.335 77
Pathologic Function 0.541 0.541 0.541 669
Pharmacologic Substance 0.577 0.623 0.599 1258
Physiologic Function 0.283 0.286 0.284 182
Plant 0.603 0.618 0.610 403
Population Group 0.710 0.711 0.711 1263
Professional Society 0.000 0.000 0.000 7
Professional or Occupational Group 0.599 0.725 0.656 360
Receptor 0.614 0.686 0.648 271
Regulation or Law 0.182 0.125 0.148 16
Reptile 1.000 0.318 0.483 22
Research Activity 0.559 0.540 0.549 1653
Self-help or Relief Organization 0.000 0.000 0.000 2
Sign or Symptom 0.629 0.647 0.638 340
Spatial Concept 0.474 0.483 0.479 1282
Therapeutic or Preventive Procedure 0.609 0.617 0.613 2036
Tissue 0.562 0.525 0.543 259
Vertebrate 0.000 0.000 0.000 1
Virus 0.678 0.831 0.747 172
Vitamin 0.706 0.511 0.593 47
macro avg 0.507 0.525 0.512 40144
weighted avg 0.571 0.589 0.578 40144

Hyperparameters

Hyperparameter tuning was done with optuna and the hyperparameter_search functionality. 100 trials were run. Early stopping was applied during training. The best performing model was selected using the macro F1 performance on the validation set. The selected hyperparameters are in the table below.

Hyperparameter Value
epochs 25.0
learning_rate 7.86794379743531e-05
per_device_train_batch_size 16
weight_decay 0.06816454557507429
warmup_ratio 0.07903396276412193
Downloads last month
10
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Glasgow-AI4BioMed/bioner_medmentions_st21pv_finegrain

Collection including Glasgow-AI4BioMed/bioner_medmentions_st21pv_finegrain