UMCU's picture
Update README.md
f03504d verified
|
raw
history blame
576 Bytes
metadata
license: gpl-3.0
datasets:
  - oscar-corpus/OSCAR-2301
language:
  - nl
base_model:
  - DTAI-KULeuven/robbert-2023-dutch-base
pipeline_tag: text-classification
tags:
  - medical

We used GPT4.1-nano to classify generic texts from OSCAR as non-medical/medical using PubScience. We labeled 400.000 texts, with about 40.000 labeled as positive. We then trained a SequenceClassifier on 80.000 samples with a 50/50 class ratio.

This can be used e.g. to approximately identify medical texts in general corpora.