File size: 2,971 Bytes
b60fd8a
 
 
 
 
 
 
 
d18d34e
b60fd8a
e95984a
b60fd8a
d18d34e
b60fd8a
e95984a
b60fd8a
3db09d0
 
 
 
 
 
 
 
 
 
 
 
 
fcd0fbe
3db09d0
 
 
 
 
d18d34e
b60fd8a
bc678b4
238ac8c
dbc0ac7
d18d34e
 
 
 
 
 
b60fd8a
bc678b4
 
 
 
 
 
 
 
b60fd8a
 
bd737dd
dbc0ac7
bc678b4
 
dbc0ac7
 
 
b60fd8a
e95984a
48b21c5
 
 
 
 
 
 
1007cee
48b21c5
 
 
e95984a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
library_name: transformers
tags:
- unsloth
- trl
- sft
---

# LLaMA 3 8B fine-tuned on Quaero for Named Entity Recognition (Generative)

This model is a 16-bit merged version of [unsloth/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/unsloth/Meta-Llama-3.1-8B-Instruct), fine-tuned on the [Quaero French medical dataset](https://quaerofrenchmed.limsi.fr/) using a generative approach to Named Entity Recognition (NER).

## Task

The model was trained to extract entities from French biomedical sentences (medlines) using a structured, prompt-based format.

| Tag    | Description                                                 |
| ------ | ----------------------------------------------------------- |
| `DISO` | **Diseases** or health-related conditions                   |
| `ANAT` | **Anatomical parts** (organs, tissues, body regions, etc.)  |
| `PROC` | **Medical or surgical procedures**                          |
| `DEVI` | **Medical devices or instruments**                          |
| `CHEM` | **Chemical substances or medications**                      |
| `LIVB` | **Living beings** (e.g. humans, animals, bacteria, viruses) |
| `GEOG` | **Geographical locations** (e.g. countries, regions)        |
| `OBJC` | **Physical objects** not covered by other categories        |
| `PHEN` | **Biological processes** (e.g. inflammation, mutation)      |
| `PHYS` | **Physiological functions** (e.g. respiration, vision)      |

I use `<>` as a separator and the output format is : 

```
TAG_1 entity_1 <> TAG_2 entity_2 <> ... <> TAG_n entity_n'
```

## Dataset

The original dataset is Quaero French Medical Corpus and I converted it to a JSON format for generative instruction-style training.


```json
{
  "input": "Etude de l'efficacité et de la tolérance de la prazosine à libération prolongée chez des patients hypertendus et diabétiques non insulinodépendants.",
  "output": "DISO tolérance <> CHEM prazosine <> LIVB patients <> DISO hypertendus <> DISO diabétiques non insulinodépendants"
}
```

The QUAERO French Medical corpus features **overlapping entity spans**, including nested structures, for instance : 
```json
{
  "input": "Cancer du pancréas",
  "output": "DISO Cancer <> DISO Cancer du pancréas <> ANAT pancréas"
}
```

## Evaluation

Evaluation was performed on the test split by comparing the predicted entity set against the ground truth annotations using exact (type, entity) matching.

| Metric    | Score  |
| --------- | ------ |
| Precision | 0.6827 |
| Recall    | 0.7121 |
| F1 Score  | 0.6971 |


## Other formats

This model is also available in the following formats:

- **LoRA Adapter**  
  → [yqnis/llama3-8b-quaero-lora](https://huggingface.co/yqnis/llama3-8b-quaero-lora)

- **GGUF Q8_0**  
  → [yqnis/llama3-8b-quaero-gguf](https://huggingface.co/yqnis/llama3-8b-quaero-gguf)


This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.