UMCU's picture
Update README.md
126ae2b verified
metadata
id: MedRoBERTa.nl_CardioNER.128xtokenWindow
name: MedRoBERTa.nl_CardioNER.128xtokenWindow
description: >-
  MedRoBERTa.nl with a finetuned head ([dense layers [1024,512,256]) for
  multilabel NER task with a tokenwindow of 128
license: gpl-3.0
language: nl
tags:
  - lexical semantic
  - span classification
  - science
  - biology
  - clinical ner
  - biomedical
  - ner,medical
  - bionlp
base_model:
  - CLTL/MedRoBERTa.nl
pipeline_tag: token-classification
datasets:
  - DT4H/CardioCCC
  - UMCU/cardioccc_dutch

Model Card for Cardioner.nl 128

This a CLTL/MedRoBERTa.nl base model with finetuned heads for span classification. For this model we used IOB-tagging. Using the IOB-tagging schema facilitates the aggregation of predictions over sequences. This specific model is trained on a batch of about 500 span-labeled documents.

This is version was trained with context windows of 128 tokens. For the chunking we used a paragraph-based splitter.

The training was performed with 10 fold CV, with SLERP (chained) averaging of the best epochs per fold.

NOTE: the base weights are exactly the same as for the original MedRoBERTa.nl, we added an expressive head with about 1.4 million parameters that was trained on the CardioCCC NER dataset.

Expected input and output

The input should be a string with Dutch clinical text related to cardiology.

CardioNER.nl_128 is a multiclass span classification model. The classes that can be predicted are

  • procedure,
  • medication,
  • disease,
  • symptom.

Extracting span classification from CardioNER.nl_128xtokenWindow

The following script converts a string of <128 tokens to a list of span predictions.

from transformers import pipeline

le_pipe = pipeline('ner',
                    model=model,
                    tokenizer=model, aggregation_strategy="simple",
                    trust_remote_code=True,
                    device=-1)

named_ents = le_pipe(SOME_TEXT)

To process a string of arbitrary length you can split the string into sentences or paragraphs using e.g. pysbd or spacy(sentencizer) and iteratively parse the list of with the span-classification pipe. You can also use the strider built in the transformer pipeline, although this is limited to non-overlapping strides plus it requires a FastTokenizer and it does not work for aggregation_strategy=None;

named_ents = le_pipe(SOME_TEXT, stride=256)

Data description

CardioCCC; manually labeled cardiology discharge letters; procedure, medication, disease, symptom

Acknowledgement

This is part of the DT4H project.

Doi and reference

For more details about training/eval and other scripts, see CardioNER github repo. and for more information on the background, see Datatools4Heart Huggingface/Website