|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
base_model: legmlai/legml-v1.0-base |
|
tags: |
|
- llama-factory |
|
- full |
|
- generated_from_trainer |
|
model-index: |
|
- name: legml-v1.0-instruct |
|
results: |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: gpqa-fr |
|
type: ai2_arc |
|
config: le-leadboard/gpqa-fr |
|
split: test |
|
args: |
|
num_few_shot: 25 |
|
metrics: |
|
- type: acc |
|
value: 14.56 |
|
name: accuracy |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: IFEval-fr |
|
type: le-leadboard/IFEval-fr |
|
split: validation |
|
args: |
|
num_few_shot: 10 |
|
metrics: |
|
- type: acc |
|
value: 13.55 |
|
name: accuracy |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MMMLU-fr |
|
type: le-leadboard/MMMLU-fr |
|
config: all |
|
split: test |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 64.57 |
|
name: accuracy |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: bbh-fr |
|
type: le-leadboard/bbh-fr |
|
config: multiple_choice |
|
split: validation |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: acc |
|
value: 38.71 |
|
name: accuracy |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: musr-fr |
|
type: le-leadboard/musr-fr |
|
config: le-leadboard/musr-fr |
|
split: validation |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 4.41 |
|
name: accuracy |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MATH_LVL5_fr |
|
type: le-leadboard/MATH_LVL5_fr |
|
config: le-leadboard/MATH_LVL5_fr |
|
split: test |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 34.44 |
|
name: accuracy |
|
datasets: |
|
- legmlai/finefrench-v1 |
|
- legmlai/openhermes-fr |
|
language: |
|
- fr |
|
--- |
|
# legml-v1.0-instruct — L’Excellence Française de l’Instruction-Tuning |
|
|
|
<div align="center"> |
|
|
|
<img src="https://cdn-avatars.huggingface.co/v1/production/uploads/639c5c448a34ed9a404a956b/d0-xNWyRNOzlrCwOZD3Qf.png" alt="legml.ai" width="120"/> |
|
|
|
**L’IA pure qui forme l’IA : un corpus 100 % francophone sélectionné et contrôlé** |
|
**Curated by [legml.ai](https://legml.ai) – Leader in AI Data Curation & Quality Assurance** |
|
|
|
 |
|
 |
|
 |
|
 |
|
 |
|
 |
|
|
|
</div> |
|
|
|
--- |
|
|
|
## 1 • Présentation |
|
|
|
**`legmlai/legml-v1.0-instruct`** est la déclinaison *instruction-tuned* de **legml-v1.0-base** (Qwen-3 · 8 B). |
|
Elle a été affinée sur **[Open-Hermes-FR](https://huggingface.co/datasets/legmlai/openhermes-fr)**, un corpus de **799 875** paires instruction/réponse exclusivement en français, issu de la traduction puis distillation d’OpenHermes original :contentReference[oaicite:0]{index=0}. |
|
|
|
Projet conçu et maintenu par **[Mohamad Alhajar](https://www.linkedin.com/in/mohamad-alhajar/)**. |
|
> **🙏 Merci à [Nebius](https://nebius.ai/)** pour le sponsoring GPU : **24 × H100 80 Go** qui ont permis cet entraînement. |
|
|
|
--- |
|
|
|
## 2 • Spécifications |
|
|
|
| Paramètre | Valeur | |
|
|-----------|--------| |
|
| **Base** | `legmlai/legml-v1.0-base` (Qwen-3 · 8 B) | |
|
| **Taille modèle** | ≈ 16 Go (fp16) / 8 Go (bf16) | |
|
| **Jeu d’instructions** | Open-Hermes-FR – 799 875 paires, 100 % français :contentReference[oaicite:1]{index=1} | |
|
| **Méthode** | SFT multi-tour + DPO léger | |
|
| **Licence** | Apache-2.0 | |
|
|
|
--- |
|
|
|
## 3 • À propos d’Open-Hermes-FR |
|
|
|
- **Origine** : traduction GPT-4o → français, puis génération des réponses et filtrage automatique. |
|
- **Taille** : ~ 800 k exemples, schéma `prompt` / `accepted_completion` (+ flags qualité) :contentReference[oaicite:2]{index=2} |
|
- **Licence** : ODC-BY 1.0 (libre, obligation d’attribution) :contentReference[oaicite:3]{index=3} |
|
- **Objectif** : fournir un socle cohérent et riche pour l’alignement des LLMs francophones (dialogue, raisonnement, QA). |
|
|
|
--- |
|
|
|
## 4 • Exemple d’utilisation « chat » |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
model_id = "legmlai/legml-v1.0-instruct" |
|
|
|
tok = AutoTokenizer.from_pretrained(model_id) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_id, |
|
device_map="auto", |
|
torch_dtype="auto" |
|
) |
|
|
|
messages = [ |
|
{"role": "system", |
|
"content": "Tu es un assistant francophone rigoureux et bienveillant."}, |
|
{"role": "user", |
|
"content": "Explique-moi la relativité restreinte en trois points."} |
|
] |
|
|
|
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
inputs = tok(prompt, return_tensors="pt").to(device) |
|
|
|
out = model.generate( |
|
**inputs, |
|
temperature=0.4, |
|
top_p=0.9, |
|
max_new_tokens=512, |
|
repetition_penalty=1.05 |
|
) |
|
|
|
print(tok.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True)) |
|
```` |
|
|
|
--- |
|
|
|
## 5 • Hyper-paramètres conseillés |
|
|
|
| Scénario | Température | top-p | max\_new\_tokens | |
|
| --------------------- | ----------- | ----- | ---------------- | |
|
| Réponse factuelle | 0.3 – 0.5 | 0.9 | 128 – 256 | |
|
| Explication détaillée | 0.4 – 0.6 | 0.9 | 512 – 768 | |
|
| Création littéraire | 0.7 – 0.9 | 0.95 | ≥ 512 | |
|
|
|
--- |
|
|
|
## 6 • Limitations connues |
|
|
|
1. **Connaissances post-avril 2025** limitées — vérifiez toujours les faits récents. |
|
2. **Raisonnement mathématique compétition** encore perfectible. |
|
3. **Biais** : certaines traces des datasets sources et de GPT-4o subsistent. |
|
|
|
--- |
|
|
|
## 7 • Citation |
|
|
|
``` |
|
@misc{legml2025_instruct, |
|
title = {legml-v1.0-instruct : French Instruction-Tuned LLM}, |
|
author = {Mohamad Alhajar}, |
|
howpublished = {https://huggingface.co/legmlai/legml-v1.0-instruct}, |
|
year = {2025} |
|
} |
|
``` |
|
|
|
--- |
|
|
|
© 2025 – [legml.ai](https://legml.ai) • Apache-2.0 |