File size: 8,178 Bytes
32cfcb6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 |
---
datasets:
- louisbrulenaudet/Romulus-cpt-fr
license: llama3
language:
- fr
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
pipeline_tag: text-generation
library_name: transformers
tags:
- law
- droit
- unsloth
- trl
- transformers
- sft
- llama
---
[![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
# QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF
This is quantized version of [louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1](https://huggingface.co/louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1) created using llama.cpp
# Original Model Card
<img src="assets/thumbnail.webp">
# Romulus, continually pre-trained models for French law.
Romulus is a series of continually pre-trained models enriched in French law and intended to serve as the basis for a fine-tuning process on labeled data. Please note that these models have not been aligned for the production of usable text as they stand, and will certainly need to be fine-tuned for the desired tasks in order to produce satisfactory results.
The training corpus is made up of around 34,864,949 tokens (calculated with the meta-llama/Meta-Llama-3.1-8B tokenizer).
## Hyperparameters
The following table outlines the key hyperparameters used for training Romulus.
| **Parameter** | **Description** | **Value** |
|----------------------------------|-----------------------------------------------------------------|-----------------------------|
| `max_seq_length` | Maximum sequence length for the model | 4096 |
| `load_in_4bit` | Whether to load the model in 4-bit precision | False |
| `model_name` | Pre-trained model name from Hugging Face | meta-llama/Meta-Llama-3.1-8B|
| `r` | Rank of the LoRA adapter | 128 |
| `lora_alpha` | Alpha value for the LoRA module | 32 |
| `lora_dropout` | Dropout rate for LoRA layers | 0 |
| `bias` | Bias type for LoRA adapters | none |
| `use_gradient_checkpointing` | Whether to use gradient checkpointing | unsloth |
| `train_batch_size` | Per device training batch size | 8 |
| `gradient_accumulation_steps` | Number of gradient accumulation steps | 8 |
| `warmup_ratio` | Warmup steps as a fraction of total steps | 0.1 |
| `num_train_epochs` | Number of training epochs | 1 |
| `learning_rate` | Learning rate for the model | 5e-5 |
| `embedding_learning_rate` | Learning rate for embeddings | 1e-5 |
| `optim` | Optimizer used for training | adamw_8bit |
| `weight_decay` | Weight decay to prevent overfitting | 0.01 |
| `lr_scheduler_type` | Type of learning rate scheduler | linear |
# Training script
Romulus was trained using Unsloth on a Nvidia H100 Azure EST US instance provided by the Microsoft for Startups program from this script:
```python
# -*- coding: utf-8 -*-
import os
from typing import (
Dict,
)
from datasets import load_dataset
from unsloth import (
FastLanguageModel,
is_bfloat16_supported,
UnslothTrainer,
UnslothTrainingArguments,
)
max_seq_length = 4096
dtype = None
load_in_4bit = False
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="meta-llama/Meta-Llama-3.1-8B",
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
token="hf_token",
)
model = FastLanguageModel.get_peft_model(
model,
r=128,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"embed_tokens",
"lm_head",
],
lora_alpha=32,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
use_rslora=True,
loftq_config=None,
)
prompt = """### Référence :
{}
### Contenu :
{}"""
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
"""
Format input examples into prompts for a language model.
This function takes a dictionary of examples containing titles and texts,
combines them into formatted prompts, and appends an end-of-sequence token.
Parameters
----------
examples : dict
A dictionary containing two keys:
- 'title': A list of titles.
- 'text': A list of corresponding text content.
Returns
-------
dict
A dictionary with a single key 'text', containing a list of formatted prompts.
Notes
-----
- The function assumes the existence of a global `prompt` variable, which is a
formatting string used to combine the title and text.
- The function also assumes the existence of a global `EOS_TOKEN` variable,
which is appended to the end of each formatted prompt.
- The input lists 'title' and 'text' are expected to have the same length.
Examples
--------
>>> examples = {
... 'title': ['Title 1', 'Title 2'],
... 'text': ['Content 1', 'Content 2']
... }
>>> formatting_cpt_prompts_func(examples)
{'text': ['<formatted_prompt_1><EOS>', '<formatted_prompt_2><EOS>']}
"""
refs = examples["ref"]
texts = examples["texte"]
outputs = []
for ref, text in zip(refs, texts):
text = prompt.format(ref, text) + EOS_TOKEN
outputs.append(text)
return {
"text": outputs,
}
cpt_dataset = load_dataset(
"louisbrulenaudet/Romulus-cpt-fr",
split="train",
token="hf_token",
)
cpt_dataset = cpt_dataset.map(
formatting_prompts_func,
batched=True,
)
trainer = UnslothTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=cpt_dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
dataset_num_proc=2,
args=UnslothTrainingArguments(
per_device_train_batch_size=8,
gradient_accumulation_steps=8,
warmup_ratio=0.1,
num_train_epochs=1,
learning_rate=5e-5,
embedding_learning_rate=1e-5,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
report_to="wandb",
save_steps=350,
run_name="romulus-cpt",
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
),
)
trainer_stats = trainer.train()
```
<img src="assets/loss.png">
## Citing & Authors
If you use this code in your research, please use the following BibTeX entry.
```BibTeX
@misc{louisbrulenaudet2024,
author = {Louis Brulé Naudet},
title = {Romulus, continually pre-trained models for French law},
year = {2024}
howpublished = {\url{https://huggingface.co/datasets/louisbrulenaudet/Romulus-cpt-fr}},
}
```
## Feedback
If you have any feedback, please reach out at [[email protected]](mailto:[email protected]).
|