Update README.md

5cc8024 verified about 2 months ago

6.34 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: legmlai/legml-v1.0-base
	tags:
	- llama-factory
	- full
	- generated_from_trainer
	model-index:
	- name: legml-v1.0-instruct
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: gpqa-fr
	type: ai2_arc
	config: le-leadboard/gpqa-fr
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc
	value: 14.56
	name: accuracy
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: IFEval-fr
	type: le-leadboard/IFEval-fr
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc
	value: 13.55
	name: accuracy
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMMLU-fr
	type: le-leadboard/MMMLU-fr
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 64.57
	name: accuracy
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: bbh-fr
	type: le-leadboard/bbh-fr
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: acc
	value: 38.71
	name: accuracy
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: musr-fr
	type: le-leadboard/musr-fr
	config: le-leadboard/musr-fr
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 4.41
	name: accuracy
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MATH_LVL5_fr
	type: le-leadboard/MATH_LVL5_fr
	config: le-leadboard/MATH_LVL5_fr
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 34.44
	name: accuracy
	datasets:
	- legmlai/finefrench-v1
	- legmlai/openhermes-fr
	language:
	- fr
	---
	# legml-v1.0-instruct — L’Excellence Française de l’Instruction-Tuning

	<div align="center">

	<img src="https://cdn-avatars.huggingface.co/v1/production/uploads/639c5c448a34ed9a404a956b/d0-xNWyRNOzlrCwOZD3Qf.png" alt="legml.ai" width="120"/>

	L’IA pure qui forme l’IA : un corpus 100 % francophone sélectionné et contrôlé
	Curated by [legml.ai](https://legml.ai) – Leader in AI Data Curation & Quality Assurance

	![Model](https://img.shields.io/badge/Model-legml_v1.0_instruct-blue)
	![Language](https://img.shields.io/badge/Language-Français-red)
	![License](https://img.shields.io/badge/License-Apache--2.0-green)
	![Dataset](https://img.shields.io/badge/Dataset-OpenHermes_FR-orange)
	![Pairs](https://img.shields.io/badge/Pairs-~800k-critical)
	![GPU](https://img.shields.io/badge/Training_Resources-24x_H100-purple)

	</div>

	---

	## 1 • Présentation

	`legmlai/legml-v1.0-instruct` est la déclinaison instruction-tuned de legml-v1.0-base (Qwen-3 · 8 B).
	Elle a été affinée sur [Open-Hermes-FR](https://huggingface.co/datasets/legmlai/openhermes-fr), un corpus de 799 875 paires instruction/réponse exclusivement en français, issu de la traduction puis distillation d’OpenHermes original :contentReference[oaicite:0]{index=0}.

	Projet conçu et maintenu par [Mohamad Alhajar](https://www.linkedin.com/in/mohamad-alhajar/).
	> 🙏 Merci à [Nebius](https://nebius.ai/) pour le sponsoring GPU : 24 × H100 80 Go qui ont permis cet entraînement.

	---

	## 2 • Spécifications

	\| Paramètre \| Valeur \|
	\|-----------\|--------\|
	\| Base \| `legmlai/legml-v1.0-base` (Qwen-3 · 8 B) \|
	\| Taille modèle \| ≈ 16 Go (fp16) / 8 Go (bf16) \|
	\| Jeu d’instructions \| Open-Hermes-FR – 799 875 paires, 100 % français :contentReference[oaicite:1]{index=1} \|
	\| Méthode \| SFT multi-tour + DPO léger \|
	\| Licence \| Apache-2.0 \|

	---

	## 3 • À propos d’Open-Hermes-FR

	- Origine : traduction GPT-4o → français, puis génération des réponses et filtrage automatique.
	- Taille : ~ 800 k exemples, schéma `prompt` / `accepted_completion` (+ flags qualité) :contentReference[oaicite:2]{index=2}
	- Licence : ODC-BY 1.0 (libre, obligation d’attribution) :contentReference[oaicite:3]{index=3}
	- Objectif : fournir un socle cohérent et riche pour l’alignement des LLMs francophones (dialogue, raisonnement, QA).

	---

	## 4 • Exemple d’utilisation « chat »

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model_id = "legmlai/legml-v1.0-instruct"

	tok = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="auto",
	torch_dtype="auto"
	)

	messages = [
	{"role": "system",
	"content": "Tu es un assistant francophone rigoureux et bienveillant."},
	{"role": "user",
	"content": "Explique-moi la relativité restreinte en trois points."}
	]

	prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tok(prompt, return_tensors="pt").to(device)

	out = model.generate(
	**inputs,
	temperature=0.4,
	top_p=0.9,
	max_new_tokens=512,
	repetition_penalty=1.05
	)

	print(tok.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
	````

	---

	## 5 • Hyper-paramètres conseillés

	\| Scénario \| Température \| top-p \| max\_new\_tokens \|
	\| --------------------- \| ----------- \| ----- \| ---------------- \|
	\| Réponse factuelle \| 0.3 – 0.5 \| 0.9 \| 128 – 256 \|
	\| Explication détaillée \| 0.4 – 0.6 \| 0.9 \| 512 – 768 \|
	\| Création littéraire \| 0.7 – 0.9 \| 0.95 \| ≥ 512 \|

	---

	## 6 • Limitations connues

	1. Connaissances post-avril 2025 limitées — vérifiez toujours les faits récents.
	2. Raisonnement mathématique compétition encore perfectible.
	3. Biais : certaines traces des datasets sources et de GPT-4o subsistent.

	---

	## 7 • Citation

	```
	@misc{legml2025_instruct,
	title = {legml-v1.0-instruct : French Instruction-Tuned LLM},
	author = {Mohamad Alhajar},
	howpublished = {https://huggingface.co/legmlai/legml-v1.0-instruct},
	year = {2025}
	}
	```

	---

	© 2025 – [legml.ai](https://legml.ai) • Apache-2.0