File size: 2,985 Bytes
c2b7954
31c581d
c2b7954
 
 
31c581d
 
 
 
 
c2b7954
 
d2c4cb8
31c581d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c2b7954
d2c4cb8
 
c2b7954
 
d2c4cb8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
library_name: transformers
tags:
- unsloth
- trl
- sft
- med
- mistral
- quaero
- lora
---

# Mistral 7B fine-tuned on Quaero for Named Entity Recognition (Generative)

This is a **LoRA adapter** version of [unsloth/mistral-7b-instruct-v0.3](https://huggingface.co/unsloth/mistral-7b-instruct-v0.3), fine-tuned on the [Quaero French medical dataset](https://quaerofrenchmed.limsi.fr/) using a generative approach to Named Entity Recognition (NER).

## Task

The model was trained to extract entities from French biomedical sentences (medlines) using a structured, prompt-based format.

| Tag    | Description                                                 |
| ------ | ----------------------------------------------------------- |
| `DISO` | **Diseases** or health-related conditions                   |
| `ANAT` | **Anatomical parts** (organs, tissues, body regions, etc.)  |
| `PROC` | **Medical or surgical procedures**                          |
| `DEVI` | **Medical devices or instruments**                          |
| `CHEM` | **Chemical substances or medications**                      |
| `LIVB` | **Living beings** (e.g. humans, animals, bacteria, viruses) |
| `GEOG` | **Geographical locations** (e.g. countries, regions)        |
| `OBJC` | **Physical objects** not covered by other categories        |
| `PHEN` | **Biological processes** (e.g. inflammation, mutation)      |
| `PHYS` | **Physiological functions** (e.g. respiration, vision)      |

I use `<>` as a separator and the output format is : 

```
TAG_1 entity_1 <> TAG_2 entity_2 <> ... <> TAG_n entity_n
```

## Dataset

The original dataset is Quaero French Medical Corpus and I converted it to a JSON format for generative instruction-style training.


```json
{
  "input": "Etude de l'efficacité et de la tolérance de la prazosine à libération prolongée chez des patients hypertendus et diabétiques non insulinodépendants.",
  "output": "DISO tolérance <> CHEM prazosine <> LIVB patients <> DISO hypertendus <> DISO diabétiques non insulinodépendants"
}
```

The QUAERO French Medical corpus features **overlapping entity spans**, including nested structures, for instance : 
```json
{
  "input": "Cancer du pancréas",
  "output": "DISO Cancer <> DISO Cancer du pancréas <> ANAT pancréas"
}
```

## Evaluation

Evaluation was performed on the test split by comparing the predicted entity set against the ground truth annotations using exact (type, entity) matching.

| Metric    | Score  |
| --------- | ------ |
| Precision | 0.6883 |
| Recall    | 0.7143 |
| F1 Score  | 0.7011 |


## Other formats

This model is also available in the following formats:

- **16bit**  
  → [yqnis/mistral-7b-quaero](https://huggingface.co/yqnis/mistral-7b-quaero)

- **GGUF Q5_K_M**  
  → [yqnis/mistral-7b-quaero-gguf](https://huggingface.co/yqnis/mistral-7b-quaero-gguf)


This mistral model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.