|
--- |
|
library_name: transformers |
|
tags: |
|
- unsloth |
|
- trl |
|
- sft |
|
- med |
|
- mistral |
|
- quaero |
|
--- |
|
|
|
# Mistral 7B fine-tuned on Quaero for Named Entity Recognition (Generative) |
|
|
|
This model is a 16-bit merged version of [unsloth/mistral-7b-instruct-v0.3](https://huggingface.co/unsloth/mistral-7b-instruct-v0.3), fine-tuned on the [Quaero French medical dataset](https://quaerofrenchmed.limsi.fr/) using a generative approach to Named Entity Recognition (NER). |
|
|
|
## Task |
|
|
|
The model was trained to extract entities from French biomedical sentences (medlines) using a structured, prompt-based format. |
|
|
|
| Tag | Description | |
|
| ------ | ----------------------------------------------------------- | |
|
| `DISO` | **Diseases** or health-related conditions | |
|
| `ANAT` | **Anatomical parts** (organs, tissues, body regions, etc.) | |
|
| `PROC` | **Medical or surgical procedures** | |
|
| `DEVI` | **Medical devices or instruments** | |
|
| `CHEM` | **Chemical substances or medications** | |
|
| `LIVB` | **Living beings** (e.g. humans, animals, bacteria, viruses) | |
|
| `GEOG` | **Geographical locations** (e.g. countries, regions) | |
|
| `OBJC` | **Physical objects** not covered by other categories | |
|
| `PHEN` | **Biological processes** (e.g. inflammation, mutation) | |
|
| `PHYS` | **Physiological functions** (e.g. respiration, vision) | |
|
|
|
I use `<>` as a separator and the output format is : |
|
|
|
``` |
|
TAG_1 entity_1 <> TAG_2 entity_2 <> ... <> TAG_n entity_n |
|
``` |
|
|
|
## Dataset |
|
|
|
The original dataset is Quaero French Medical Corpus and I converted it to a JSON format for generative instruction-style training. |
|
|
|
|
|
```json |
|
{ |
|
"input": "Etude de l'efficacité et de la tolérance de la prazosine à libération prolongée chez des patients hypertendus et diabétiques non insulinodépendants.", |
|
"output": "DISO tolérance <> CHEM prazosine <> LIVB patients <> DISO hypertendus <> DISO diabétiques non insulinodépendants" |
|
} |
|
``` |
|
|
|
The QUAERO French Medical corpus features **overlapping entity spans**, including nested structures, for instance : |
|
```json |
|
{ |
|
"input": "Cancer du pancréas", |
|
"output": "DISO Cancer <> DISO Cancer du pancréas <> ANAT pancréas" |
|
} |
|
``` |
|
|
|
## Evaluation |
|
|
|
Evaluation was performed on the test split by comparing the predicted entity set against the ground truth annotations using exact (type, entity) matching. |
|
|
|
| Metric | Score | |
|
| --------- | ------ | |
|
| Precision | 0.6883 | |
|
| Recall | 0.7143 | |
|
| F1 Score | 0.7011 | |
|
|
|
|
|
## Other formats |
|
|
|
This model is also available in the following formats: |
|
|
|
- **LoRA Adapter** |
|
→ [yqnis/mistral-7b-quaero-lora](https://huggingface.co/yqnis/mistral-7b-quaero-lora) |
|
|
|
- **GGUF Q5_K_M** |
|
→ [yqnis/mistral-7b-quaero-gguf](https://huggingface.co/yqnis/mistral-7b-quaero-gguf) |
|
|
|
|
|
This mistral model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. |