Multilingual Symptom Extraction (English + Amharic)

Model Description

This model is a fine-tuned XLM-R-base model for extracting symptoms from patient generated texts in English and Amharic. It was developed as part of an MSc thesis at the Technical University of Munich, aiming to support AI-powered symptom extraction platforms in multilingual healthcare settings.

The model performs named entity recognition (NER) to identify symptom mentions in unstructured texts using the BIO tagging scheme.

Intended Uses & Limitations

Intended uses:

  • Automatic symptom extraction from patient-generated texts in English and Amharic.
  • Research in multilingual biomedical NLP.
  • Integration into AI diagnostic platforms for low-resource languages.
  • Limitations:
  • Only trained on the datasets described in the thesis; performance on other domains may vary.
  • Not validated for real-world clinical deployment without further testing and ethical approval.
  • Training Data

  • Unlabeled English, Amharic, and Tigrinya corpora collected from diverse sources for domain adaptation.
  • A combination of publicly available datasets and synthetically generated data, and then labeled, was used for fine-tuning the model.
  • Preprocessed with tokenization, normalization, and Docanno BIO tagging.

Evaluation

  • Metrics: Precision, Recall, F1-score for symptom extraction.
  • Results and detailed evaluation are described in the MSc thesis:
    Negash Desalegn. (2025). Bridging the Linguistic Gap in Healthcare: A Multilingual AI Approach for Symptom Extraction in Low-Resource Languages. Technical University of Munich.
Downloads last month
4
Safetensors
Model size
277M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kechemale/eng-am-symptom-ner

Finetuned
(3429)
this model

Dataset used to train kechemale/eng-am-symptom-ner