distilbert-medication-ner

This model is a fine-tuned version of distilbert-base-cased on synthetically generated medication data by Synthea.

More details on how this model was trained can be found on GitHub.

Model Description

A fine-tuned NER model developed to handle 5 specific entities (i.e. DRUG, DOSAGE, ROUTE, BRAND, QUANTITY) when processing medication strings such as:

Ibuprofen 100 MG Oral Tablet
1 ML medroxyprogesterone acetate 150 MG/ML Injection
Acetaminophen 325 MG / Oxycodone Hydrochloride 10 MG Oral Tablet [Percocet]

The model was trained and evaluated on limited manually annotated datasets (i.e. train_n_samples=309, eval_n_samples=335), achieved the following evaluation metrics:

Precision: 0.998
Recall: 0.983
F1: 0.991

Usage

Load model:

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = "jackleejm/distilbert-medication-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

Setup a pipeline and run inferences:

from transformers import pipeline

ner_pipeline = pipeline(
  task="token-classification",
  model=model,
  tokenizer=tokenizer,
  aggregation_strategy="simple",
  device_map="auto",
)

input = ["Acetaminophen 325 MG Oral Tablet"]
results = ner_pipeline(input)

print(results)

# Outputs
[
  [
    {
      "word": "Acetaminophen",
      "score": np.float32(0.99948627),
      "entity_group": "DRUG",
      "start": 0,
      "end": 13
    },
    {
      "word": "325 MG",
      "score": np.float32(0.99882394),
      "entity_group": "DOSAGE",
      "start": 14,
      "end": 20
    },
    {
      "word": "Oral Tablet",
      "score": np.float32(0.9994621),
      "entity_group": "ROUTE",
      "start": 21,
      "end": 32
    }
  ]
]

Training Procedure

Training Hyperparameters

learning_rate: 2e-5
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
num_train_epochs: 20
weight_decay: 0.01
evaluation_strategy: "steps"
eval_steps: 50
load_best_model_at_end: True
metric_for_best_model: "f1"

Framework versions

Transformers 4.49.0
Pytorch 2.6.0
Datasets 3.3.2
Tokenizers 0.21.0