|
--- |
|
license: gpl-3.0 |
|
datasets: |
|
- oscar-corpus/OSCAR-2301 |
|
language: |
|
- nl |
|
base_model: |
|
- DTAI-KULeuven/robbert-2023-dutch-base |
|
pipeline_tag: text-classification |
|
tags: |
|
- medical |
|
--- |
|
|
|
|
|
We used GPT4.1-nano to classify generic texts from OSCAR as non-medical/medical using [PubScience](https://github.com/bramiozo/PubScience/tree/main/pubscience/label). We labeled 400.000 texts, with about 40.000 labeled as positive. |
|
We then trained a SequenceClassifier on 80.000 samples with a 50/50 class ratio. |
|
|
|
This can be used e.g. to approximately identify medical texts in general corpora. |