UMCU
/

DutchMedicalTextDetector_v1_HEADONLY

Text Classification

Model card Files Files and versions

DutchMedicalTextDetector_v1_HEADONLY / README.md

UMCU's picture

Update README.md

f03504d verified 3 months ago

|

576 Bytes

	---
	license: gpl-3.0
	datasets:
	- oscar-corpus/OSCAR-2301
	language:
	- nl
	base_model:
	- DTAI-KULeuven/robbert-2023-dutch-base
	pipeline_tag: text-classification
	tags:
	- medical
	---


	We used GPT4.1-nano to classify generic texts from OSCAR as non-medical/medical using [PubScience](https://github.com/bramiozo/PubScience/tree/main/pubscience/label). We labeled 400.000 texts, with about 40.000 labeled as positive.
	We then trained a SequenceClassifier on 80.000 samples with a 50/50 class ratio.

	This can be used e.g. to approximately identify medical texts in general corpora.