File size: 2,971 Bytes
b60fd8a d18d34e b60fd8a e95984a b60fd8a d18d34e b60fd8a e95984a b60fd8a 3db09d0 fcd0fbe 3db09d0 d18d34e b60fd8a bc678b4 238ac8c dbc0ac7 d18d34e b60fd8a bc678b4 b60fd8a bd737dd dbc0ac7 bc678b4 dbc0ac7 b60fd8a e95984a 48b21c5 1007cee 48b21c5 e95984a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
---
library_name: transformers
tags:
- unsloth
- trl
- sft
---
# LLaMA 3 8B fine-tuned on Quaero for Named Entity Recognition (Generative)
This model is a 16-bit merged version of [unsloth/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/unsloth/Meta-Llama-3.1-8B-Instruct), fine-tuned on the [Quaero French medical dataset](https://quaerofrenchmed.limsi.fr/) using a generative approach to Named Entity Recognition (NER).
## Task
The model was trained to extract entities from French biomedical sentences (medlines) using a structured, prompt-based format.
| Tag | Description |
| ------ | ----------------------------------------------------------- |
| `DISO` | **Diseases** or health-related conditions |
| `ANAT` | **Anatomical parts** (organs, tissues, body regions, etc.) |
| `PROC` | **Medical or surgical procedures** |
| `DEVI` | **Medical devices or instruments** |
| `CHEM` | **Chemical substances or medications** |
| `LIVB` | **Living beings** (e.g. humans, animals, bacteria, viruses) |
| `GEOG` | **Geographical locations** (e.g. countries, regions) |
| `OBJC` | **Physical objects** not covered by other categories |
| `PHEN` | **Biological processes** (e.g. inflammation, mutation) |
| `PHYS` | **Physiological functions** (e.g. respiration, vision) |
I use `<>` as a separator and the output format is :
```
TAG_1 entity_1 <> TAG_2 entity_2 <> ... <> TAG_n entity_n'
```
## Dataset
The original dataset is Quaero French Medical Corpus and I converted it to a JSON format for generative instruction-style training.
```json
{
"input": "Etude de l'efficacité et de la tolérance de la prazosine à libération prolongée chez des patients hypertendus et diabétiques non insulinodépendants.",
"output": "DISO tolérance <> CHEM prazosine <> LIVB patients <> DISO hypertendus <> DISO diabétiques non insulinodépendants"
}
```
The QUAERO French Medical corpus features **overlapping entity spans**, including nested structures, for instance :
```json
{
"input": "Cancer du pancréas",
"output": "DISO Cancer <> DISO Cancer du pancréas <> ANAT pancréas"
}
```
## Evaluation
Evaluation was performed on the test split by comparing the predicted entity set against the ground truth annotations using exact (type, entity) matching.
| Metric | Score |
| --------- | ------ |
| Precision | 0.6827 |
| Recall | 0.7121 |
| F1 Score | 0.6971 |
## Other formats
This model is also available in the following formats:
- **LoRA Adapter**
→ [yqnis/llama3-8b-quaero-lora](https://huggingface.co/yqnis/llama3-8b-quaero-lora)
- **GGUF Q8_0**
→ [yqnis/llama3-8b-quaero-gguf](https://huggingface.co/yqnis/llama3-8b-quaero-gguf)
This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
|