File size: 3,008 Bytes
1ce292b
 
d6c3ecd
 
 
 
cd4c074
 
1ce292b
 
d6c3ecd
1ce292b
d6c3ecd
1ce292b
d6c3ecd
1ce292b
d6c3ecd
1ce292b
d6c3ecd
 
 
 
 
 
 
 
 
 
 
 
1ce292b
d6c3ecd
1ce292b
d6c3ecd
ccb463b
d6c3ecd
1ce292b
d6c3ecd
1ce292b
d6c3ecd
1ce292b
 
d6c3ecd
 
 
 
 
 
1ce292b
d6c3ecd
 
 
 
 
 
 
1ce292b
 
 
d6c3ecd
1ce292b
d6c3ecd
 
 
 
 
1ce292b
 
d6c3ecd
1ce292b
d6c3ecd
1ce292b
d6c3ecd
 
1ce292b
d6c3ecd
 
1ce292b
 
cd4c074
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
library_name: transformers
tags:
- unsloth
- trl
- sft
- med
- mistral
---

# LLaMA 3 8B fine-tuned on Quaero for Named Entity Recognition (Generative)

This model is a 16-bit merged version of [unsloth/mistral-7b-instruct-v0.3](https://huggingface.co/unsloth/mistral-7b-instruct-v0.3), fine-tuned on the [Quaero French medical dataset](https://quaerofrenchmed.limsi.fr/) using a generative approach to Named Entity Recognition (NER).

## Task

The model was trained to extract entities from French biomedical sentences (medlines) using a structured, prompt-based format.

| Tag    | Description                                                 |
| ------ | ----------------------------------------------------------- |
| `DISO` | **Diseases** or health-related conditions                   |
| `ANAT` | **Anatomical parts** (organs, tissues, body regions, etc.)  |
| `PROC` | **Medical or surgical procedures**                          |
| `DEVI` | **Medical devices or instruments**                          |
| `CHEM` | **Chemical substances or medications**                      |
| `LIVB` | **Living beings** (e.g. humans, animals, bacteria, viruses) |
| `GEOG` | **Geographical locations** (e.g. countries, regions)        |
| `OBJC` | **Physical objects** not covered by other categories        |
| `PHEN` | **Biological processes** (e.g. inflammation, mutation)      |
| `PHYS` | **Physiological functions** (e.g. respiration, vision)      |

I use `<>` as a separator and the output format is : 

```
TAG_1 entity_1 <> TAG_2 entity_2 <> ... <> TAG_n entity_n
```

## Dataset

The original dataset is Quaero French Medical Corpus and I converted it to a JSON format for generative instruction-style training.


```json
{
  "input": "Etude de l'efficacité et de la tolérance de la prazosine à libération prolongée chez des patients hypertendus et diabétiques non insulinodépendants.",
  "output": "DISO tolérance <> CHEM prazosine <> LIVB patients <> DISO hypertendus <> DISO diabétiques non insulinodépendants"
}
```

The QUAERO French Medical corpus features **overlapping entity spans**, including nested structures, for instance : 
```json
{
  "input": "Cancer du pancréas",
  "output": "DISO Cancer <> DISO Cancer du pancréas <> ANAT pancréas"
}
```

## Evaluation

Evaluation was performed on the test split by comparing the predicted entity set against the ground truth annotations using exact (type, entity) matching.

| Metric    | Score  |
| --------- | ------ |
| Precision | 0.6883 |
| Recall    | 0.7143 |
| F1 Score  | 0.7011 |


## Other formats

This model is also available in the following formats:

- **LoRA Adapter**  
  → [yqnis/mistral-7b-quaero-lora](https://huggingface.co/yqnis/mistral-7b-quaero-lora)

- **GGUF Q8_0**  
  → [yqnis/mistral-7b-quaero-gguf](https://huggingface.co/yqnis/llama3-8b-quaero-yqnis/mistral-7b-quaero-gguf)


This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.