semaj83's picture
Update README.md
0ffa86c
metadata
datasets:
  - semaj83/ctmatch
language:
  - en
metrics:
  - f1
pipeline_tag: text-classification
tags:
  - medical
widget:
  - text: >-
      Patient is a 45-year-old man with a history of anaplastic astrocytoma of
      the spine complicated by severe lower extremity weakness and urinary
      retention s/p Foley catheter, high-dose steroids, hypertension, and
      chronic pain. Therapy included field radiation t10-l1 followed by 11
      cycles of temozolomide 7 days on and 7 days off. This was followed by
      CPT-11 Weekly x4 with Avastin Q2 weeks/ 2 weeks rest and repeat cycle.
      [SEP] eligible ages (years): 18.0-99.0, Low-Grade Astrocytoma, Nos
      Histologically or cytologically confirmed low-grade astrocytoma that has
      progressed, recurred, or persisted after initial therapy, including
      radiotherapy Previously treated with at least 1 prior standard therapy
      (e.g., radiotherapy, chemotherapy, immunotherapy, or cytodifferentiating
      agent)
  - text: >-
      Patient is a 45-year-old man with a history of anaplastic astrocytoma of
      the spine complicated by severe lower extremity weakness and urinary
      retention s/p Foley catheter, high-dose steroids, hypertension, and
      chronic pain. Therapy included field radiation t10-l1 followed by 11
      cycles of temozolomide 7 days on and 7 days off. This was followed by
      CPT-11 Weekly x4 with Avastin Q2 weeks/ 2 weeks rest and repeat cycle.
      [SEP] eligible ages (years): 21.0-80.0, Muscle Spasticity Healthy Adult
      patients with selective corticospinal tract dysfunction Minimum age 21
      years; maximum age 80 years Moderate severity of weakness (greater than or
      equal to MRC Grade 4) Adult normal volunteers Severe weakness with
      inability to maintain voluntary contractions Significant sensory
      impairment For TMS studies only: pregnancy, implanted devices such as
      pacemakers, medication pumps or defibrillators, metal in the cranium
      except the mouth, intracardiac lines, history of seizures

Model Card for semaj83/scibert_finetuned_ctmatch

This model can be used for classifying "<topic> [SEP] <clinical trial document>" pairs into 3 classes, 0, 1, 2, or not relevant, partially relevant, and relevant.

Model Details

Fine-tuned from 'allenai/scibert_scivocab_uncased' on triples of labelled topic, documents, relevance labels. These triples were processed using ctproc, collated from the openly available TREC22 Precision Medicine and CSIRO datasets here: https://huggingface.co/datasets/semaj83/ctmatch_classification

Model Description

Transformer model with linear sequence classification head, trained with cross-entropy on ~30k triples and evaluated using f1.

  • Developed by: James Kelly
  • Model type: SequenceClassification
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: allenai/scibert_scivocab_uncased

Model Sources

Uses

Direct Use

[More Information Needed]

Downstream Use

ctmatch IR pipeline for matching large set of clinical trials documents to text topic.

Bias, Risks, and Limitations

Please see dataset sources for information on patient descriptions (topics), constructed by medical professionals for these datasets. No personal health information about real individuals is contained in the related dataset. Links in dataset location on hub.

The claissifier model performs much better on deciding if a pair is 0 - not relevant, than differentiating between 1, partially relevant, and 2, relevant, though this is still an important clinical task.

How to Get Started with the Model

Use the code below to get started with the model.


from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("semaj83/scibert_finetuned_ctmatch")

model = AutoModelForSequenceClassification.from_pretrained("semaj83/scibert_finetuned_ctmatch")

Training Details

see notebook in ctmatch repo.

Training Data

https://huggingface.co/datasets/semaj83/ctmatch

Preprocessing

If using ctmatch labelled dataset, using the tokenizer alone is sufficient. If using raw topic and/or clinical trial documents, you may need to use ctproc or another method to extract relevant fields and preprocess text.

Training Hyperparameters

max_sequence_length=512 batch_size=8 padding='max_length' truncation=True learning_rate=2e-5 train_epochs=5 weight_decay=0.01 warmup_steps=500 seed=42 splits={"train":0.8, "val":0.1} use_trainer=True fp16=True early_stopping=True

Evaluation

sklearn classifier table on random test split:


                precision    recall  f1-score   support

           0       0.88      0.93      0.90      5430
           1       0.56      0.56      0.56      1331
           2       0.65      0.49      0.56      1178

    accuracy                           0.80      7939
   macro avg       0.70      0.66      0.67      7939
weighted avg       0.79      0.80      0.79      7939

Model Card Authors

James Kelly

Model Card Contact

[email protected]