bioner_medmentions_st21pv_finegrain
This is a named entity recognition model fine-tuned from the microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext model. It predicts spans with 91 possible labels. The labels are Acquired Abnormality, Amino Acid Sequence, Amino Acid, Peptide, or Protein, Amphibian, Anatomical Abnormality, Anatomical Structure, Animal, Antibiotic, Bacterium, Biologic Function, Biologically Active Substance, Biomedical Occupation or Discipline, Biomedical or Dental Material, Bird, Body Location or Region, Body Part, Organ, or Organ Component, Body Space or Junction, Body Substance, Body System, Cell, Cell Component, Cell Function, Cell or Molecular Dysfunction, Chemical, Chemical Viewed Functionally, Chemical Viewed Structurally, Classification, Clinical Attribute, Congenital Abnormality, Diagnostic Procedure, Disease or Syndrome, Drug Delivery Device, Element, Ion, or Isotope, Embryonic Structure, Enzyme, Eukaryote, Experimental Model of Disease, Finding, Fish, Food, Fully Formed Anatomical Structure, Fungus, Gene or Genome, Genetic Function, Geographic Area, Hazardous or Poisonous Substance, Health Care Activity, Health Care Related Organization, Hormone, Human, Idea or Concept, Immunologic Factor, Indicator, Reagent, or Diagnostic Aid, Injury or Poisoning, Inorganic Chemical, Intellectual Product, Laboratory Procedure, Laboratory or Test Result, Mammal, Medical Device, Mental Process, Mental or Behavioral Dysfunction, Molecular Biology Research Technique, Molecular Function, Molecular Sequence, Neoplastic Process, Nucleic Acid, Nucleoside, or Nucleotide, Nucleotide Sequence, Organ or Tissue Function, Organic Chemical, Organism Function, Organization, Pathologic Function, Pharmacologic Substance, Physiologic Function, Plant, Population Group, Professional Society, Professional or Occupational Group, Receptor, Regulation or Law, Reptile, Research Activity, Self-help or Relief Organization, Sign or Symptom, Spatial Concept, Therapeutic or Preventive Procedure, Tissue, Vertebrate, Virus and Vitamin.
The code used for training this model can be found at https://github.com/Glasgow-AI4BioMed/bioner along with links to other biomedical NER models trained on well-known biomedical corpora. The source dataset information is below.
Example Usage
The code below will load up the model and apply it to the provided text. It uses a simple aggregation strategy to post-process the individual tokens into larger multi-token entities where needed.
from transformers import pipeline
# Load the model as part of an NER pipeline
ner_pipeline = pipeline("token-classification",
model="Glasgow-AI4BioMed/bioner_medmentions_st21pv_finegrain",
aggregation_strategy="max")
# Apply it to some text
ner_pipeline("EGFR T790M mutations have been known to affect treatment outcomes for NSCLC patients receiving erlotinib.")
# Output:
# [ {"entity_group": "Gene or Genome", "score": 0.96229, "word": "egfr", "start": 0, "end": 4},
# {"entity_group": "Genetic Function", "score": 0.91988, "word": "t790m mutations", "start": 5, "end": 20},
# {"entity_group": "Neoplastic Process", "score": 0.99883, "word": "nsclc", "start": 51, "end": 56},
# {"entity_group": "Pharmacologic Substance", "score": 0.99931, "word": "erlotinib", "start": 76, "end": 85} ]
Dataset Info
Source: The ST21pv version of MedMentions was downloaded from: https://github.com/chanzuckerberg/MedMentions/tree/master/st21pv
The dataset should be cited with: Mohan, Sunil, and Donghui Li. "MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts." Automated Knowledge Base Construction (AKBC), 2019, https://openreview.net/forum?id=SylxCx5pTQ. DOI: 10.24432/C5G59C
An overview of semantic types can be found at: https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html
Preprocessing: The training, validation and test splits were maintained from the original dataset. Concept identifiers (CUIs) were used to map each annotation to its associated UMLS entry to recover semantic types (from the MRSTY.RRF UMLS file). Semantic types provided in MedMentions were not used. Annotations were mapped to specific semantic types names using the Semantic Groups file available at: https://www.nlm.nih.gov/research/umls/knowledge_sources/semantic_network/index.html. This contrasts with the finegrained version that mapped annotations to semantic groups. The preprocessing script for this dataset is prepare_medmentions.py with the --finegrain flag.
Performance
The span-level performance on the test split for the different labels are shown in the tables below. The full performance results are available in the model repo in Markdown format for viewing and JSON format for easier loading. These include the performance at token level (with individual B- and I- labels as the token classifier uses IOB2 token labelling).
Label | Precision | Recall | F1-score | Support |
---|---|---|---|---|
Acquired Abnormality | 0.273 | 0.240 | 0.255 | 50 |
Amino Acid Sequence | 0.303 | 0.357 | 0.328 | 84 |
Amino Acid, Peptide, or Protein | 0.258 | 0.277 | 0.267 | 166 |
Anatomical Abnormality | 0.190 | 0.145 | 0.164 | 76 |
Anatomical Structure | 0.160 | 0.308 | 0.211 | 13 |
Animal | 0.605 | 0.742 | 0.667 | 93 |
Antibiotic | 0.835 | 0.784 | 0.808 | 148 |
Bacterium | 0.752 | 0.732 | 0.742 | 448 |
Biologic Function | 0.341 | 0.369 | 0.355 | 157 |
Biologically Active Substance | 0.586 | 0.640 | 0.612 | 2080 |
Biomedical Occupation or Discipline | 0.437 | 0.444 | 0.441 | 196 |
Biomedical or Dental Material | 0.369 | 0.452 | 0.406 | 197 |
Bird | 0.810 | 0.819 | 0.814 | 83 |
Body Location or Region | 0.386 | 0.461 | 0.420 | 232 |
Body Part, Organ, or Organ Component | 0.565 | 0.604 | 0.584 | 1092 |
Body Space or Junction | 0.272 | 0.311 | 0.290 | 90 |
Body Substance | 0.542 | 0.731 | 0.622 | 212 |
Body System | 0.554 | 0.511 | 0.532 | 90 |
Cell | 0.680 | 0.737 | 0.708 | 924 |
Cell Component | 0.589 | 0.637 | 0.612 | 311 |
Cell Function | 0.484 | 0.599 | 0.535 | 499 |
Cell or Molecular Dysfunction | 0.607 | 0.657 | 0.631 | 99 |
Chemical | 0.348 | 0.333 | 0.340 | 72 |
Chemical Viewed Functionally | 0.286 | 0.432 | 0.344 | 37 |
Chemical Viewed Structurally | 0.443 | 0.427 | 0.435 | 82 |
Classification | 0.520 | 0.544 | 0.532 | 309 |
Clinical Attribute | 0.596 | 0.625 | 0.610 | 323 |
Congenital Abnormality | 0.438 | 0.443 | 0.440 | 79 |
Diagnostic Procedure | 0.672 | 0.648 | 0.660 | 735 |
Disease or Syndrome | 0.757 | 0.774 | 0.766 | 2199 |
Element, Ion, or Isotope | 0.713 | 0.657 | 0.684 | 385 |
Embryonic Structure | 0.587 | 0.509 | 0.545 | 53 |
Enzyme | 0.766 | 0.761 | 0.763 | 681 |
Eukaryote | 0.745 | 0.793 | 0.768 | 397 |
Experimental Model of Disease | 0.286 | 0.356 | 0.317 | 45 |
Finding | 0.391 | 0.388 | 0.389 | 2759 |
Fish | 1.000 | 0.947 | 0.973 | 19 |
Food | 0.556 | 0.455 | 0.501 | 336 |
Fully Formed Anatomical Structure | 0.000 | 0.000 | 0.000 | 1 |
Fungus | 0.819 | 0.798 | 0.809 | 119 |
Gene or Genome | 0.551 | 0.539 | 0.545 | 912 |
Genetic Function | 0.598 | 0.646 | 0.621 | 652 |
Geographic Area | 0.673 | 0.712 | 0.692 | 598 |
Hazardous or Poisonous Substance | 0.513 | 0.522 | 0.518 | 293 |
Health Care Activity | 0.487 | 0.458 | 0.472 | 1061 |
Health Care Related Organization | 0.531 | 0.642 | 0.581 | 296 |
Hormone | 0.806 | 0.746 | 0.775 | 189 |
Human | 0.799 | 0.880 | 0.837 | 158 |
Idea or Concept | 0.000 | 0.000 | 0.000 | 1 |
Immunologic Factor | 0.674 | 0.606 | 0.638 | 434 |
Indicator, Reagent, or Diagnostic Aid | 0.427 | 0.451 | 0.439 | 182 |
Injury or Poisoning | 0.617 | 0.703 | 0.657 | 357 |
Inorganic Chemical | 0.611 | 0.680 | 0.643 | 256 |
Intellectual Product | 0.495 | 0.485 | 0.490 | 2075 |
Laboratory Procedure | 0.445 | 0.452 | 0.448 | 908 |
Laboratory or Test Result | 0.183 | 0.196 | 0.190 | 112 |
Mammal | 0.778 | 0.838 | 0.807 | 456 |
Medical Device | 0.434 | 0.437 | 0.435 | 355 |
Mental Process | 0.546 | 0.546 | 0.546 | 740 |
Mental or Behavioral Dysfunction | 0.710 | 0.774 | 0.741 | 518 |
Molecular Biology Research Technique | 0.500 | 0.539 | 0.519 | 206 |
Molecular Function | 0.504 | 0.555 | 0.528 | 719 |
Molecular Sequence | 0.417 | 0.556 | 0.476 | 9 |
Neoplastic Process | 0.761 | 0.745 | 0.753 | 918 |
Nucleic Acid, Nucleoside, or Nucleotide | 0.331 | 0.450 | 0.381 | 109 |
Nucleotide Sequence | 0.320 | 0.491 | 0.387 | 110 |
Organ or Tissue Function | 0.482 | 0.425 | 0.452 | 247 |
Organic Chemical | 0.396 | 0.464 | 0.427 | 511 |
Organism Function | 0.462 | 0.518 | 0.488 | 471 |
Organization | 0.270 | 0.442 | 0.335 | 77 |
Pathologic Function | 0.541 | 0.541 | 0.541 | 669 |
Pharmacologic Substance | 0.577 | 0.623 | 0.599 | 1258 |
Physiologic Function | 0.283 | 0.286 | 0.284 | 182 |
Plant | 0.603 | 0.618 | 0.610 | 403 |
Population Group | 0.710 | 0.711 | 0.711 | 1263 |
Professional Society | 0.000 | 0.000 | 0.000 | 7 |
Professional or Occupational Group | 0.599 | 0.725 | 0.656 | 360 |
Receptor | 0.614 | 0.686 | 0.648 | 271 |
Regulation or Law | 0.182 | 0.125 | 0.148 | 16 |
Reptile | 1.000 | 0.318 | 0.483 | 22 |
Research Activity | 0.559 | 0.540 | 0.549 | 1653 |
Self-help or Relief Organization | 0.000 | 0.000 | 0.000 | 2 |
Sign or Symptom | 0.629 | 0.647 | 0.638 | 340 |
Spatial Concept | 0.474 | 0.483 | 0.479 | 1282 |
Therapeutic or Preventive Procedure | 0.609 | 0.617 | 0.613 | 2036 |
Tissue | 0.562 | 0.525 | 0.543 | 259 |
Vertebrate | 0.000 | 0.000 | 0.000 | 1 |
Virus | 0.678 | 0.831 | 0.747 | 172 |
Vitamin | 0.706 | 0.511 | 0.593 | 47 |
macro avg | 0.507 | 0.525 | 0.512 | 40144 |
weighted avg | 0.571 | 0.589 | 0.578 | 40144 |
Hyperparameters
Hyperparameter tuning was done with optuna and the hyperparameter_search functionality. 100 trials were run. Early stopping was applied during training. The best performing model was selected using the macro F1 performance on the validation set. The selected hyperparameters are in the table below.
Hyperparameter | Value |
---|---|
epochs | 25.0 |
learning_rate | 7.86794379743531e-05 |
per_device_train_batch_size | 16 |
weight_decay | 0.06816454557507429 |
warmup_ratio | 0.07903396276412193 |
- Downloads last month
- 10