Update README.md

0ffa86c almost 2 years ago

5.24 kB

	---
	datasets:
	- semaj83/ctmatch
	language:
	- en
	metrics:
	- f1
	pipeline_tag: text-classification
	tags:
	- medical
	widget:
	- text: "Patient is a 45-year-old man with a history of anaplastic astrocytoma of the spine complicated by severe lower extremity weakness and urinary retention s/p Foley catheter, high-dose steroids, hypertension, and chronic pain. Therapy included field radiation t10-l1 followed by 11 cycles of temozolomide 7 days on and 7 days off. This was followed by CPT-11 Weekly x4 with Avastin Q2 weeks/ 2 weeks rest and repeat cycle. [SEP] eligible ages (years): 18.0-99.0, Low-Grade Astrocytoma, Nos Histologically or cytologically confirmed low-grade astrocytoma that has progressed, recurred, or persisted after initial therapy, including radiotherapy Previously treated with at least 1 prior standard therapy (e.g., radiotherapy, chemotherapy, immunotherapy, or cytodifferentiating agent)"
	- text: "Patient is a 45-year-old man with a history of anaplastic astrocytoma of the spine complicated by severe lower extremity weakness and urinary retention s/p Foley catheter, high-dose steroids, hypertension, and chronic pain. Therapy included field radiation t10-l1 followed by 11 cycles of temozolomide 7 days on and 7 days off. This was followed by CPT-11 Weekly x4 with Avastin Q2 weeks/ 2 weeks rest and repeat cycle. [SEP] eligible ages (years): 21.0-80.0, Muscle Spasticity Healthy Adult patients with selective corticospinal tract dysfunction Minimum age 21 years; maximum age 80 years Moderate severity of weakness (greater than or equal to MRC Grade 4) Adult normal volunteers Severe weakness with inability to maintain voluntary contractions Significant sensory impairment For TMS studies only: pregnancy, implanted devices such as pacemakers, medication pumps or defibrillators, metal in the cranium except the mouth, intracardiac lines, history of seizures"
	---

	# Model Card for semaj83/scibert_finetuned_ctmatch


	This model can be used for classifying "\<topic\> [SEP] \<clinical trial document\>" pairs into 3 classes, 0, 1, 2, or not relevant, partially relevant, and relevant.

	## Model Details

	Fine-tuned from 'allenai/scibert_scivocab_uncased' on triples of labelled topic, documents, relevance labels.
	These triples were processed using ctproc, collated from the openly available TREC22 Precision Medicine and CSIRO datasets here:
	https://huggingface.co/datasets/semaj83/ctmatch_classification

	### Model Description

	Transformer model with linear sequence classification head, trained with cross-entropy on ~30k triples and evaluated using f1.



	- Developed by: James Kelly
	- Model type: SequenceClassification
	- Language(s) (NLP): English
	- License: MIT
	- Finetuned from model: `allenai/scibert_scivocab_uncased`

	### Model Sources

	- Repository: https://github.com/semajyllek/ctmatch
	- Paper [optional]: [More Information Needed]

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	### Direct Use



	[More Information Needed]

	### Downstream Use

	ctmatch IR pipeline for matching large set of clinical trials documents to text topic.



	## Bias, Risks, and Limitations

	Please see dataset sources for information on patient descriptions (topics), constructed by medical professionals for these datasets.
	No personal health information about real individuals is contained in the related dataset.
	Links in dataset location on hub.

	The claissifier model performs much better on deciding if a pair is 0 - not relevant, than differentiating between 1, partially relevant, and 2, relevant,
	though this is still an important clinical task.


	## How to Get Started with the Model

	Use the code below to get started with the model.

	```

	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("semaj83/scibert_finetuned_ctmatch")

	model = AutoModelForSequenceClassification.from_pretrained("semaj83/scibert_finetuned_ctmatch")

	```

	## Training Details

	see notebook in ctmatch repo.



	### Training Data

	https://huggingface.co/datasets/semaj83/ctmatch


	#### Preprocessing

	If using ctmatch labelled dataset, using the tokenizer alone is sufficient. If using raw topic and/or clinical trial documents,
	you may need to use ctproc or another method to extract relevant fields and preprocess text.


	#### Training Hyperparameters


	`max_sequence_length=512
	batch_size=8
	padding='max_length'
	truncation=True
	learning_rate=2e-5
	train_epochs=5
	weight_decay=0.01
	warmup_steps=500
	seed=42
	splits={"train":0.8, "val":0.1}
	use_trainer=True
	fp16=True
	early_stopping=True
	`


	## Evaluation

	sklearn classifier table on random test split:

	```

	precision recall f1-score support

	0 0.88 0.93 0.90 5430
	1 0.56 0.56 0.56 1331
	2 0.65 0.49 0.56 1178

	accuracy 0.80 7939
	macro avg 0.70 0.66 0.67 7939
	weighted avg 0.79 0.80 0.79 7939

	```


	## Model Card Authors

	James Kelly

	## Model Card Contact

	[email protected]