bioner_ncbi_disease

This is a named entity recognition model fine-tuned from the microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext model. It predicts spans with 2 possible labels. The labels are DiseaseClass and SpecificDisease.

The code used for training this model can be found at https://github.com/Glasgow-AI4BioMed/bioner along with links to other biomedical NER models trained on well-known biomedical corpora. The source dataset information is below.

Example Usage

The code below will load up the model and apply it to the provided text. It uses a simple aggregation strategy to post-process the individual tokens into larger multi-token entities where needed.

from transformers import pipeline

# Load the model as part of an NER pipeline
ner_pipeline = pipeline("token-classification", 
                        model="Glasgow-AI4BioMed/bioner_ncbi_disease",
                        aggregation_strategy="max")

# Apply it to some text
ner_pipeline("Tuberculous is an infectious disease.")

# Output:
# [ {"entity_group": "SpecificDisease", "score": 0.99873, "word": "tuberculous", "start": 0, "end": 11},
#   {"entity_group": "DiseaseClass", "score": 0.99418, "word": "infectious disease", "start": 18, "end": 36} ]

Dataset Info

Source: The NCBI Disease dataset was downloaded from: https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/

The dataset should be cited with: Doğan, Rezarta Islamaj, Robert Leaman, and Zhiyong Lu. "NCBI disease corpus: a resource for disease name recognition and concept normalization." Journal of biomedical informatics 47 (2014): 1-10. DOI: 10.1016/j.jbi.2013.12.006

Preprocessing: The training/validation/test split was maintained from the original dataset. The annotations were filtered down to only 'DiseaseClass' and 'SpecificDisease'. The preprocessing script for this dataset is prepare_ncbi_disease.py.

Performance

The span-level performance on the test split for the different labels are shown in the tables below. The full performance results are available in the model repo in Markdown format for viewing and JSON format for easier loading. These include the performance at token level (with individual B- and I- labels as the token classifier uses IOB2 token labelling).

Label	Precision	Recall	F1-score	Support
DiseaseClass	0.592	0.769	0.669	121
SpecificDisease	0.816	0.809	0.813	555
macro avg	0.704	0.789	0.741	676
weighted avg	0.776	0.802	0.787	676

Hyperparameters

Hyperparameter tuning was done with optuna and the hyperparameter_search functionality. 100 trials were run. Early stopping was applied during training. The best performing model was selected using the macro F1 performance on the validation set. The selected hyperparameters are in the table below.

Hyperparameter	Value
epochs	9.0
learning_rate	4.2369194386745274e-05
per_device_train_batch_size	8
weight_decay	0.11095292966544487
warmup_ratio	0.009641097927077978

Downloads last month: 7

Safetensors

Model size

109M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Glasgow-AI4BioMed/bioner_ncbi_disease

Base model

microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

Finetuned

(69)

this model

Collection including Glasgow-AI4BioMed/bioner_ncbi_disease

BioNER Models

Collection

A selection of biomedical named entity recognition models trained on well-known datasets • 15 items • Updated Jul 14